vpc-ccg / haslr

A fast tool for hybrid genome assembly of long and short reads
GNU General Public License v3.0
74 stars 9 forks source link

Missing result files #4

Open bpucker opened 4 years ago

bpucker commented 4 years ago

Hi,

Everything runs very well and no error message is given. However, the final result files are just missing. I tried it twice on an Arabidopsis thaliana data set.

PacBio reads from this study: https://doi.org/10.1371/journal.pone.0216233 Illumina reads from this study: https://doi.org/10.1371/journal.pone.0164321

There is no time given for the last step listed in "asm_contigs_k49_a3_lr25x_b500_s3_sim0.85.err": [...] [NOTE] cleaning small bubbles... removed 228 small bubbles elapsed time 325.79 CPU seconds (512.21 real seconds)

[NOTE] calculating long read coordinates between anchors... elapsed time 670.53 CPU seconds (548.19 real seconds)

[NOTE] calling consensus sequence between anchors...

Cheers, Boas

haghshenas commented 4 years ago

Hey Boas,

Thanks for trying HASLR. What you are reporting seems like a bug. In order to reproduce this, can you tell me the steps and commands you used, please? Also, can you check N50 of short read assembly (in file sr_k49_a3.contigs.nooverlap.250.fa)?

haghshenas commented 4 years ago

Meanwhile, I downloaded the datasets. On SRX1434948 and SRX1683821, Minia generates very poor short read assemblies. I tried HASLR with SRX1683594 and all pacbio reads (obtained by running bax2bam+bam2fasta) and was able to get the assembly. It seems your long reads have relatively high accuracy. So try different parameters for --cov-lr and --aln-sim. Probably increasing them might help to get more contiguous assembly.

bpucker commented 4 years ago

Thank you very much for your quick reply. You incorporated only the two mate pair data sets in your short read assembly. I just tried the assembly with one paired-end 2x110 nt data set (https://www.ncbi.nlm.nih.gov/sra/SRX1434944). It was just to see if everything is working. There would be substantially more short reads (about 240x coverage, https://www.ncbi.nlm.nih.gov/sra/?term=Niederzenz-1).

Here are some assembly statistics for all contigs >=500bp in sr_k49_a3.contigs.nooverlap.250.fa: number of contigs: 18496 average contig length: 5532 minimal contig length: 500 maximal contig length: 102139

total number of bases: 102334204 GC content: 0.358499793481

N25: 22833 N50: 12447 N75: 5716 N90: 2381

Here are all the steps starting with installation: conda create -n HASLR_bp -c bioconda haslr

. /etc/profile.d/conda.sh

conda activate HASLR_bp

haslr.py \ -o /xxxx/nd1_test/ \ -g 145g \ -l /xxxx/subreads.fasta \ -x pacbio \ -s /xxxxx/AtNd1120309_ISGA2X_00051_FC64UWKAALane1-1.fastq \ /xxxxxx/AtNd1120309_ISGA2X_00051_FC64UWKAALane1-2.fastq \ -t 10

Thanks, Boas

haghshenas commented 4 years ago

This is strange! I just tried on SRX1434944 as well and it is working. The steps I followed:

Is there a chance that the data we are using is not exactly the same?

bpucker commented 4 years ago

I am working with the original data sets that were submitted to the SRA. Therefore, I think we can exclude that. Most steps are working fine thus the names of the reads are probably not an issue. Just the final files "asm.final.fa" and "asm.final.ann" are missing. Here is the content of the "asm_contigs_k49_a3_lr25x_b500_s3_sim0.85" folder:

backbone.01.init.gfa backbone.05.superbubble.log backbone.01.init.stat backbone.05.superbubble.stat backbone.02.weakEdge.gfa backbone.06.smallbubble.gfa backbone.02.weakEdge.stat backbone.06.smallbubble.log backbone.03.tip.gfa backbone.06.smallbubble.stat backbone.03.tip.log backbone.branching.log backbone.03.tip.stat compact_uniq.txt backbone.04.simplebubble.gfa index.contig backbone.04.simplebubble.log index.longread backbone.04.simplebubble.stat log_consensus.txt backbone.05.superbubble.gfa log_coordinate.txt

haghshenas commented 4 years ago

I see your point about the dataset. The problem is that I was not able to reproduce the same issue you encountered with the steps I mentioned. If I can manage to reproduce the issue then I can debug and fix.

TGatter commented 4 years ago

The problem arises on some machines as SPOA is compiled on a command set not compatible with the CPU. Likely this arises because of the "-march=native" in the SPOA make files.

haghshenas commented 4 years ago

Sorry for getting back to this so late. Thanks @TGatter for your suggestion. Did you also encounter the same issue?

The option -marge=native is supposed to automatically enable -msse4.1 for gcc/g++ if it is supported by the CPU. If this is causing the problem, you should encounter the same problem on every dataset.

@bpucker and @TGatter could you please download the following small Lambda phage sample dataset and try HASLR to check if you see the same issue? http://www.cbcb.umd.edu/software/PBcR/data/sampleData.tar.gz

haslr.py -o test -g 50k -l sampleData/pacbio.filtered_subreads.fasta -x pacbio -s sampleData/illumina.fastq

If everything goes well, the final assembly should be in test/asm_contigs_k49_a3_lr25x_b500_s3_sim0.85/asm.final.fa.

bpucker commented 4 years ago

Thank you for this suggestion. I tried it and the same issue appeared with this test dataset. There is no error message, but the process finished without producing the final assembly file.

haghshenas commented 4 years ago

I see. To check if the problem is really caused by -march=native please try the two tests below:

  1. Remove -march=native from SPOA's makefile at src/haslr_assemble/lib/spoa.make, make and run. This will force SPOA to be built without SIMD instructions.
  2. Replace -march=native with -msse4.1, make and test again.

Please let me know if you can find the final assembly in any of the above tests.

Arielyuan commented 4 years ago

Hi, I also have the same problem of missing output files, I tried with my won genome data and there is no error information during running: the log file: checking /truba/home/odogan/miniconda3/bin/haslr_assemble: ok checking /truba/home/odogan/miniconda3/bin/minia_nooverlap: ok checking /truba/home/odogan/miniconda3/bin/fastutils: ok checking /truba/home/odogan/miniconda3/bin/minia: ok checking /truba/home/odogan/miniconda3/bin/minimap2: ok number of threads: 50 output directory: /truba/home/odogan/03.Arion/04.assemble/07.haslr/hashlrAssemble subsampling 25x long reads to /truba/home/odogan/03.Arion/04.assemble/07.haslr/hashlrAssemble/lr25x.fasta... done assembling short reads using Minia... my command: haslr.py -t 50 -o hashlrAssemble -g 1.5g -l arion.nano.filtered.fa -x nanopore -s ari.scriptfilter.1.filter.cor.fq.gz ari.scriptfilter.2.filter.cor.fq.gz barcoded.fastq.gz the output files: lr25x.fasta sr.fofn sr_k49_a3.h5 sr_k49_a3.log sr_k49_a3.unitigs.fa.doubledKmers.* sr_k49_a3.unitigs.fa.glue* it seems like not completely finish.... Best regards, Ariel

haghshenas commented 4 years ago

Hi Ariel @Arielyuan The problem you are facing seems to be a different issue as it is happening during Minia assembly. Could you open a new issue and attached sr_k49_a3.log file? It contains the log information of Minia assembler.

AdrianMBarrera commented 4 years ago

Hi all,

I'm facing the same problem. Did you get to solve it somehow?

Firstly, I installed HASLR in a local system and it works perfectly.

But now, I'm trying to use it in an HPC infrastructure. I installed it through conda and everything seems to run right, but final files are missing.

I don't know if this is related, but I found that Minia version differs between the one installed in my local system (Minia version 3.2.1) and the one installed through conda (Minia version 3.2.0).

Best regards, Adrián

tpriest0 commented 4 years ago

Hi all,

Any further developments on this issue yet? I am having exactly the same problem as originally posted by bpucker.

Everything runs fun, the short read assembly also looks good but the final asm files are missing.

The last line of the .err file is:

[NOTE] calling consensus sequence between anchors...

Many thanks Taylor

haghshenas commented 4 years ago

Hi @tpriest0 Are you using the version installed by Conda or the latest commit from github? If it is installed by Conda, can you please try to build from source (latest github commit) and try?

@AdrianMBarrera can you also try to build from source on your HPC and retry? The difference is that the version on github uses -msse4.1 while the version on conda uses -march=native.

If it is confirmed that -march=native is the problem I will update the conda recipe to the latest commit. I appreciate your help.

tpriest0 commented 4 years ago

Thank you for the reply. Yes I am currently using the conda installed version. I will build from source and let you know the result.

Many thanks

--

Taylor Priest

PhD Student Molecular Ecology

Max-Planck-Institut für Marine Mikrobiologie Celsiusstr. 1 - D-28359 Bremen - Raum 2225 Telefon: +49 421 2028-936 E-Mail: tpriest@mpi-bremen.de

-----Original message----- From: Ehsan Haghshenas notifications@github.com Sent: Tuesday 21st April 2020 10:20 To: vpc-ccg/haslr haslr@noreply.github.com Cc: Taylor Priest tpriest@mpi-bremen.de; Mention mention@noreply.github.com Subject: Re: [vpc-ccg/haslr] Missing result files (#4)

Hi @tpriest0 Are you using the version installed by Conda or the latest commit from github? If it is installed by Conda, can you please try to build from source (latest github commit) and try?

@AdrianMBarrera can you also try to build from source on your HPC and retry? The difference is that the version on github uses -msse4.1 while the version on conda uses -march=native.

If it is confirmed that -march=native is the problem I will update the conda recipe to the latest commit. I appreciate your help.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

AdrianMBarrera commented 4 years ago

Hi @haghshenas

I build it from source, but I found another issue.

checking /opt/envhpc/haslr/0.8a1/bin/haslr_assemble: ok
checking /opt/envhpc/haslr/0.8a1/bin/minia_nooverlap: ok
checking /opt/envhpc/haslr/0.8a1/bin/fastutils: ok
checking /opt/envhpc/haslr/0.8a1/bin/minia: ok
checking /opt/envhpc/haslr/0.8a1/bin/minimap2: ok
number of threads: 16
output directory: .../assembly_haslr
[22-Apr-2020 17:24:33] subsampling 25x long reads to .../assembly_haslr/lr25x.fasta... /opt/envhpc/haslr/0.8a1/bin/fastutils: unrecognized option '--fofn'
[ERROR] run "fastutils subsample -h" to see the help

failed
ERROR: "fastutils" returned non-zero exit status

Should I open another issue?

Thank you in advance.

haghshenas commented 4 years ago

Hi @AdrianMBarrera, Thanks for reporting. I had added some new feature for reading input from FOFN (file of file names) but forgot to update the fastutils version in the Makefile. Now it should be fixed. Can you try and let me know?

tpriest0 commented 4 years ago

Hi @haghshenas

I built the haslr package from source and then tried to re-run the assembler.

It results in this line appearing:

'checking haslr_assemble: failed'

Everything appears to be compiled properly so I am not sure what the reason for this is.

Thanks

--

Taylor Priest

PhD Student Molecular Ecology

Max-Planck-Institut für Marine Mikrobiologie Celsiusstr. 1 - D-28359 Bremen - Raum 2225 Telefon: +49 421 2028-936 E-Mail: tpriest@mpi-bremen.de

-----Original message----- From: Ehsan Haghshenas notifications@github.com Sent: Tuesday 21st April 2020 10:20 To: vpc-ccg/haslr haslr@noreply.github.com Cc: Taylor Priest tpriest@mpi-bremen.de; Mention mention@noreply.github.com Subject: Re: [vpc-ccg/haslr] Missing result files (#4)

Hi @tpriest0 Are you using the version installed by Conda or the latest commit from github? If it is installed by Conda, can you please try to build from source (latest github commit) and try?

@AdrianMBarrera can you also try to build from source on your HPC and retry? The difference is that the version on github uses -msse4.1 while the version on conda uses -march=native.

If it is confirmed that -march=native is the problem I will update the conda recipe to the latest commit. I appreciate your help.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

haghshenas commented 4 years ago

Hi @tpriest0 This is really strange. Did you compile the code using make as explained here? If yes, can you check if ./bin/haslr_assemble -h prints the help message?

tpriest0 commented 4 years ago

Hi @haghshenas

Yes I compiled the code using make and there were no errors during compilation.

Running haslr_assemble -h does indeed print the help menu, so it seems very strange that it won't actually run.

--

Taylor Priest

PhD Student Molecular Ecology

Max-Planck-Institut für Marine Mikrobiologie Celsiusstr. 1 - D-28359 Bremen - Raum 2225 Telefon: +49 421 2028-936 E-Mail: tpriest@mpi-bremen.de

-----Original message----- From: Ehsan Haghshenas notifications@github.com Sent: Thursday 23rd April 2020 8:41 To: vpc-ccg/haslr haslr@noreply.github.com Cc: Taylor Priest tpriest@mpi-bremen.de; Mention mention@noreply.github.com Subject: Re: [vpc-ccg/haslr] Missing result files (#4)

Hi @tpriest0 This is really strange. Did you compile the code using make as explained here? If yes, can you check if ./bin/haslr_assemble -h prints the help message?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

nbargues commented 3 years ago

Hi, I have a similar issue, everything seems to work well but no final assembly are created. log error :

[NOTE] number of threads: 18

[NOTE] loading contig sequences... processing file: /home/cxx5wc/muc1/hybrid/haslr_hybrid_v3/sr_k49_a3.contigs.nooverlap.fa... Done in 0.41 CPU seconds (0.41 real seconds) loaded 273585 contigs elapsed time 0.44 CPU seconds (0.44 real seconds)

[NOTE] calculating kmer frequency of unique contigs mean: 26.20 elapsed time 0.47 CPU seconds (0.47 real seconds)

[NOTE] loading long read sequences... processing file: /home/cxx5wc/muc1/hybrid/haslr_hybrid_v3/lr25x.fasta... Done in 0.86 CPU seconds (0.86 real seconds) loaded 356001 long reads elapsed time 1.33 CPU seconds (1.33 real seconds)

[NOTE] loading alignment between contigs and long reads... processing file: /home/cxx5wc/muc1/hybrid/haslr_hybrid_v3/map_contigs_k49_a3_lr25x.paf... Done in 0.30 CPU seconds (0.30 real seconds) loaded 576 alignments elapsed time 1.68 CPU seconds (1.68 real seconds)

[NOTE] fixing overlapping alignments... elapsed time 1.69 CPU seconds (1.69 real seconds)

[NOTE] building compact long reads... elapsed time 1.76 CPU seconds (1.76 real seconds)

[NOTE] building the backbone graph... elapsed time 1.79 CPU seconds (1.79 real seconds)

[NOTE] cleaning weak edges... removed 117 edges elapsed time 1.80 CPU seconds (1.80 real seconds)

[NOTE] cleaning tips... removed 0 tips elapsed time 1.82 CPU seconds (1.82 real seconds)

[NOTE] cleaning simple bubbles... removed 0 simple bubbles elapsed time 1.83 CPU seconds (1.83 real seconds)

[NOTE] cleaning super bubbles... removed 0 super bubbles elapsed time 1.85 CPU seconds (1.85 real seconds)

[NOTE] cleaning small bubbles... removed 0 small bubbles elapsed time 1.86 CPU seconds (1.86 real seconds)

[NOTE] calculating long read coordinates between anchors... elapsed time 1.88 CPU seconds (1.87 real seconds)

[NOTE] calling consensus sequence between anchors...

tpshea2 commented 3 years ago

I have also encountered empty "asm.final.fa" files but with no error that I could find. This occurred in two out of three test runs on bacterial hybrid assemblies (Illumina+ONT). The three tests were on different genera. Further, the one that did generate content in the the asm.final.fa file was missing roughly 25% of the genome (including a 19kb plasmid as well as a ~500Kb region of the chromosome). I am not sure if these two issues (empty fasta file and an incomplete genome) may be related.

NTNguyen13 commented 3 years ago

I also encountered this problem in using conda version, everything run fine but there's no final output

haslr.py -t 6 -g 3g -l /mnt/Data/GIAB/NA12878/FASTQ_Illumina/HG001.m64011_190326_191011.consensusreads.fastq.gz -x pacbio -s /mnt/Data/GIAB/NA12878/FASTQ_Illumina/U0a_CGATGT_L001_R1_001.fastq.gz /mnt/Data/GIAB/NA12878/FASTQ_Illumina/U0a_CGATGT_L001_R2_001.fastq.gz -o /mnt/Data/GIAB/NA12878/Assembly_HASLR

[NOTE] number of threads: 6

[NOTE] loading contig sequences...
       processing file: /mnt/Data/GIAB/NA12878/Assembly_HASLR/sr_k49_a3.contigs.nooverlap.fa... Done in 0.03 CPU seconds (0.03 real seconds)
       loaded 108337 contigs
       elapsed time 0.03 CPU seconds (0.03 real seconds)

[NOTE] calculating kmer frequency of unique contigs
       mean: 30.91
       elapsed time 0.04 CPU seconds (0.04 real seconds)

[NOTE] loading long read sequences...
       processing file: /mnt/Data/GIAB/NA12878/Assembly_HASLR/lr25x.fasta... Done in 36.05 CPU seconds (48.94 real seconds)
       loaded 519210 long reads
       elapsed time 36.09 CPU seconds (48.98 real seconds)

[NOTE] loading alignment between contigs and long reads...
       processing file: /mnt/Data/GIAB/NA12878/Assembly_HASLR/map_contigs_k49_a3_lr25x.paf... Done in 0.49 CPU seconds (0.49 real seconds)
       loaded 16560 alignments
       elapsed time 37.86 CPU seconds (50.74 real seconds)

[NOTE] fixing overlapping alignments...
       elapsed time 37.92 CPU seconds (50.81 real seconds)

[NOTE] building compact long reads...
       elapsed time 37.98 CPU seconds (50.86 real seconds)

[NOTE] building the backbone graph...
       elapsed time 37.99 CPU seconds (50.88 real seconds)

[NOTE] cleaning weak edges...
       removed 22 edges
       elapsed time 38.00 CPU seconds (50.89 real seconds)

[NOTE] cleaning tips...
       removed 3 tips
       elapsed time 38.00 CPU seconds (50.89 real seconds)

[NOTE] cleaning simple bubbles...
       removed 5 simple bubbles
       elapsed time 38.00 CPU seconds (50.89 real seconds)

[NOTE] cleaning super bubbles...
       removed 0 super bubbles
       elapsed time 38.01 CPU seconds (50.89 real seconds)

[NOTE] cleaning small bubbles...
       removed 1 small bubbles
       elapsed time 38.01 CPU seconds (50.90 real seconds)

[NOTE] calculating long read coordinates between anchors...
       elapsed time 38.09 CPU seconds (50.91 real seconds)

[NOTE] calling consensus sequence between anchors...

If I change to github version, then the error is:

./haslr.py -t 6 -g 3g -l /mnt/Data/GIAB/NA12878/FASTQ_Illumina/HG001.m64011_190326_191011.consensusreads.fastq.gz -x corrected -s /mnt/Data/GIAB/NA12878/FASTQ_Illumina/U0a_CGATGT_L001_R1_001.fastq.gz /mnt/Data/GIAB/NA12878/FASTQ_Illumina/U0a_CGATGT_L001_R2_001.fastq.gz -o /mnt/Data/GIAB/NA12878/Assembly_HASLR
checking /home/nguyen/Exec/haslr/bin/haslr_assemble: ok
checking /home/nguyen/Exec/haslr/bin/minia_nooverlap: ok
checking /home/nguyen/Exec/haslr/bin/fastutils: ok
checking /home/nguyen/Exec/haslr/bin/minia: ok
checking /home/nguyen/Exec/haslr/bin/minimap2: ok
number of threads: 6
output directory: /mnt/Data/GIAB/NA12878/Assembly_HASLR
[20-Oct-2020 14:13:19] subsampling 25x long reads to /mnt/Data/GIAB/NA12878/Assembly_HASLR/lr25x.fasta... done
[20-Oct-2020 14:18:43] assembling short reads using Minia... done
[20-Oct-2020 14:20:15] removing overlaps in short read assembly... done
[20-Oct-2020 14:20:15] removing short sequences in short read assembly... done
[20-Oct-2020 14:20:15] aligning long reads to short read assembly using minimap2... failed
ERROR: "minimap2" returned non-zero exit status
fancyge commented 3 years ago

Hello, I'm using the github version and tried first with the ecoli example datasets from https://github.com/vpc-ccg/haslr which worked quite ok. But I had this error with my own dataset.

/Softwares/haslr/bin/haslr.py -t 24 -o haslr -g 800m -l long_porechop.fastq -x nanopore -s R1.fastq.gz R2.fastq.gz checking /Softwares/haslr/bin/haslr_assemble: ok checking /Softwares/haslr/bin/minia_nooverlap: ok checking /Softwares/haslr/bin/fastutils: ok checking /Softwares/haslr/bin/minia: ok checking /Softwares/haslr/bin/minimap2: ok number of threads: 24 output directory: /Assembly/haslr [20-Jan-2021 10:15:02] subsampling 25x long reads to /Assembly/haslr/lr25x.fasta... done [20-Jan-2021 10:15:29] assembling short reads using Minia... failed ERROR: "minia" returned non-zero exit status

So I thought about these options and tried with the ecoli example datasets see difference. I changed the estimated genome size to 5M (-g 5m) and what is weird is that I had different output when running multiple times with exactly the same datasets. Specifically, this error ERROR: "haslr_assemble" returned non-zero exit status , seemed to happen randomly and when it happened, no file "asm.final.fa" can be found. I also tried with -g 3m and always get this same error and no final assembly. So is this error caused by inaccurate estimated genome size? Can you tell me how important this estimated genome size setting is because I'm doing de novo assembly and also what should I do with the above error for my dataset. Thanks a lot for your help.

cssanchez23 commented 2 years ago

Just tried this with both the git clone install and the conda install. Both instances I did not receive an error ; however, the asm.final.fa was empty. Has anyone found a fix?

wanyuac commented 1 year ago

I had the same issue of missing final FASTA outputs with the conda version.