mikolmogorov / Flye

De novo assembler for single molecule sequencing reads using repeat graphs
Other
744 stars 164 forks source link

Flye does not generate any output ("No disjointigs were assembled" message) #128

Open StefanoLonardi opened 5 years ago

StefanoLonardi commented 5 years ago

I have been trying to assemble a 10Mb genome with uncorrected nanopore data (3-4 chromosomes expected). We have a lot of data, is that the reason Flye fails at the end?

[2019-06-22 11:00:05] INFO: >>>STAGE: configure [2019-06-22 11:00:05] INFO: Configuring run [2019-06-22 11:00:27] INFO: Total read length: 10964270213 [2019-06-22 11:00:27] INFO: Input genome size: 10000000 [2019-06-22 11:00:27] INFO: Estimated coverage: 1096 [2019-06-22 11:00:27] WARNING: Expected read coverage is 1096, the assembly is not guaranteed to be optimal in this setting. Are you sure that the genome size was entered correctly? [2019-06-22 11:00:27] INFO: Reads N50/N90: 29675 / 9753 [2019-06-22 11:00:27] INFO: Minimum overlap set to 5000 [2019-06-22 11:00:27] INFO: Selected k-mer size: 15 [2019-06-22 11:00:27] INFO: >>>STAGE: assembly [2019-06-22 11:00:27] INFO: Assembling disjointigs [2019-06-22 11:00:27] INFO: Reading sequences [2019-06-22 11:01:01] INFO: Generating solid k-mer index [2019-06-22 11:01:17] INFO: Counting k-mers (1/2): 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% [2019-06-22 11:02:49] INFO: Counting k-mers (2/2): 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% [2019-06-22 11:08:39] INFO: Filling index table 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% [2019-06-22 11:13:50] INFO: Extending reads [2019-06-22 12:54:29] INFO: Overlap-based coverage: 1177 [2019-06-22 12:54:29] INFO: Median overlap divergence: 0.119637 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% [2019-06-23 17:20:11] INFO: Assembled 0 disjointigs [2019-06-23 17:20:23] INFO: Generating sequence [2019-06-23 17:22:11] ERROR: No disjointigs were assembled - please check if the read type and genome size parameters are correct

flye --nano-raw one.fastq --out-dir flye --genome-size 10m --threads 20

mikolmogorov commented 5 years ago

Interesting, looks like indeed a lot of overlaps were found, but no disjointigs were assembled. Is it possible to send me the full flye.log? I also suggest to try --meta mode - it is more robust to solid k-mer selection in case there is any contamination / instrumental artificial sequence.

StefanoLonardi commented 5 years ago

[2019-06-22 11:00:05] root: INFO: Starting Flye 2.4.2-release [2019-06-22 11:00:05] root: DEBUG: Cmd: /home/stelo/miniconda2/bin/flye --nano-raw Bduncani_06182019_pass.fastq --out-dir babesia_flye --genome-size 10m --threads 20 [2019-06-22 11:00:05] root: INFO: >>>STAGE: configure [2019-06-22 11:00:05] root: INFO: Configuring run [2019-06-22 11:00:27] root: INFO: Total read length: 10964270213 [2019-06-22 11:00:27] root: INFO: Input genome size: 10000000 [2019-06-22 11:00:27] root: INFO: Estimated coverage: 1096 [2019-06-22 11:00:27] root: WARNING: Expected read coverage is 1096, the assembly is not guaranteed to be optimal in this setting. Are you sure that the genome size was entered correctly? [2019-06-22 11:00:27] root: INFO: Reads N50/N90: 29675 / 9753 [2019-06-22 11:00:27] root: INFO: Minimum overlap set to 5000 [2019-06-22 11:00:27] root: INFO: Selected k-mer size: 15 [2019-06-22 11:00:27] root: INFO: >>>STAGE: assembly [2019-06-22 11:00:27] root: INFO: Assembling disjointigs [2019-06-22 11:00:27] root: DEBUG: -----Begin assembly log------ [2019-06-22 11:00:27] root: DEBUG: Running: flye-assemble -l /24-2/home/stelo/babesia/babesia_flye/flye.log -t 20 -v 5000 -k 15 Bduncani_06182019_pas s.fastq /24-2/home/stelo/babesia/babesia_flye/00-assembly/draft_assembly.fasta 10000000 /home/stelo/miniconda2/lib/python2.7/site-packages/flye/confi g/bin_cfg/asm_raw_reads.cfg [2019-06-22 11:00:27] DEBUG: Build date: Apr 7 2019 02:34:37 [2019-06-22 11:00:27] DEBUG: Total RAM: 251 Gb [2019-06-22 11:00:27] DEBUG: Available RAM: 245 Gb [2019-06-22 11:00:27] DEBUG: Total CPUs: 40 [2019-06-22 11:00:27] DEBUG: Parameters: [2019-06-22 11:00:27] DEBUG: big_genome_threshold=29000000 [2019-06-22 11:00:27] DEBUG: low_cutoff_warning=1 [2019-06-22 11:00:27] DEBUG: hard_min_coverage_rate=10 [2019-06-22 11:00:27] DEBUG: assemble_kmer_sample=1 [2019-06-22 11:00:27] DEBUG: repeat_graph_kmer_sample=1 [2019-06-22 11:00:27] DEBUG: read_align_kmer_sample=1 [2019-06-22 11:00:27] DEBUG: maximum_jump=1500 [2019-06-22 11:00:27] DEBUG: maximum_overhang=1500 [2019-06-22 11:00:27] DEBUG: repeat_kmer_rate=100 [2019-06-22 11:00:27] DEBUG: assemble_ovlp_divergence=0.30 [2019-06-22 11:00:27] DEBUG: repeat_graph_ovlp_divergence=0.15 [2019-06-22 11:00:27] DEBUG: repeat_graph_ovlp_end_adjust=0.00 [2019-06-22 11:00:27] DEBUG: read_align_ovlp_divergence=0.25 [2019-06-22 11:00:27] DEBUG: max_coverage_drop_rate=5 [2019-06-22 11:00:27] DEBUG: chimera_window=100 [2019-06-22 11:00:27] DEBUG: min_reads_in_disjointig=4 [2019-06-22 11:00:27] DEBUG: max_inner_reads=10 [2019-06-22 11:00:27] DEBUG: max_inner_fraction=0.25 [2019-06-22 11:00:27] DEBUG: add_unassembled_reads=0 [2019-06-22 11:00:27] DEBUG: max_separation=500 [2019-06-22 11:00:27] DEBUG: tip_length_threshold=100000 [2019-06-22 11:00:27] DEBUG: unique_edge_length=50000 [2019-06-22 11:00:27] DEBUG: min_repeat_res_support=0.51 [2019-06-22 11:00:27] DEBUG: out_paths_ratio=5 [2019-06-22 11:00:27] DEBUG: graph_cov_drop_rate=10 [2019-06-22 11:00:27] DEBUG: coverage_estimate_window=100 [2019-06-22 11:00:27] DEBUG: extend_contigs_with_repeats=1 [2019-06-22 11:00:27] DEBUG: Running with k-mer size: 15 [2019-06-22 11:00:27] DEBUG: Running with minimum overlap 5000 [2019-06-22 11:00:27] DEBUG: Metagenome mode: N [2019-06-22 11:00:27] INFO: Reading sequences [2019-06-22 11:01:01] DEBUG: Building positional index [2019-06-22 11:01:01] DEBUG: Total sequence: 10964270213 bp [2019-06-22 11:01:01] DEBUG: Expected read coverage: 1096 [2019-06-22 11:01:01] INFO: Generating solid k-mer index [2019-06-22 11:01:01] DEBUG: Hard threshold set to 5 [2019-06-22 11:01:01] DEBUG: Started k-mer counting [2019-06-22 11:01:17] INFO: Counting k-mers (1/2): [2019-06-22 11:02:49] INFO: Counting k-mers (2/2): [2019-06-22 11:08:39] DEBUG: Estimated minimum kmer coverage: 155 [2019-06-22 11:08:39] DEBUG: Filtered 301351751 erroneous k-mers [2019-06-22 11:08:39] DEBUG: Repetitive k-mer frequency: 55681 [2019-06-22 11:08:39] DEBUG: Filtered 897 repetitive k-mers (8.98678e-05) [2019-06-22 11:08:39] INFO: Filling index table [2019-06-22 11:08:44] DEBUG: Sampling rate: 1 [2019-06-22 11:08:44] DEBUG: Solid k-mers: 9980428 [2019-06-22 11:08:44] DEBUG: K-mer index size: 5380562281 [2019-06-22 11:08:44] DEBUG: Mean k-mer frequency: 539.111 [2019-06-22 11:12:31] DEBUG: Sorting k-mer index [2019-06-22 11:13:50] DEBUG: Peak RAM usage: 28 Gb [2019-06-22 11:13:50] INFO: Extending reads [2019-06-22 11:13:50] DEBUG: Estimating overlap coverage [2019-06-22 12:54:29] INFO: Overlap-based coverage: 1177 [2019-06-22 12:54:29] INFO: Median overlap divergence: 0.119637 [2019-06-22 12:54:29] DEBUG: Sequence divergence distribution:

|                      *
|                      *
|                    * *
|                   ** **
|                   *****
|                   ******
|                   ********
|                   ********
|                  *********
|                  *********
|                  ***********
|                 ************
|                 ************* *
|                 ************* *
|                 ************* *
|                *****************  *
|                *********************
|                **********************
|               *************************
|             **************************************** * *     ** *
----------------------------------------------------------------------------------------------------
0%        5%        10%       15%       20%       25%       30%       35%       40%       45%

Q25 = 0.1, Q50 = 0.12, Q75 = 0.14

[2019-06-23 17:20:11] INFO: Assembled 0 disjointigs [2019-06-23 17:20:23] INFO: Generating sequence [2019-06-23 17:20:23] DEBUG: Writing FASTA [2019-06-23 17:20:23] DEBUG: Peak RAM usage: 78 Gb -----------End assembly log------------ [2019-06-23 17:22:11] root: ERROR: No disjointigs were assembled - please check if the read type and genome size parameters are correct

mikolmogorov commented 5 years ago

Thank you, indeed looks strange. Maybe high coverage confuses Flye, but I also suspect there might be some non-target reads in the sample.

I suggest to try two more runs (i) metagenome mode (ii) normal mode with --asm-coverage 50 to use the longest 50x reads for disjointig assembly. Please post the corresponding logs as well.

StefanoLonardi commented 4 years ago

I just finished running Flye using the two runs that you suggest. Both of them completed, but the assembly with ''--asm-coverage 50'' seems better (in terms of N50, total size, etc.) Thank you

mikolmogorov commented 4 years ago

Glad that it helped!

dgiguer commented 4 years ago

The solution of normal mode with --asm-coverage 50 has helped in a similar case where lots of overlap is found but no disjointigs are assembled for a plasmid!

ptrebert commented 4 years ago

@fenderglass Could you please take a quick look at the log output for the sample where flye fails to assemble disjointigs: gist.github.com/ptrebert/3964d66cd60af3e7a19d95d166707ed2

Since I am running flye with --asm-coverage 50 by default, I am a bit unsure how to proceed with this sample.

mikolmogorov commented 4 years ago

@ptrebert Seems strange. My only guess would be that PacBio reads might not be properly split into subreads (we had a couple cases like that before). Try to process the reads with https://github.com/fenderglass/pbclip - it should tell you if there is a significant amount of "chimeric" subreads.

Alternatively, you can also try to run with --meta option if the reads turn out ok.

ptrebert commented 4 years ago

@fenderglass Ok, thanks for pointing out your tool, I'll check that and get back to you.

ptrebert commented 4 years ago

ping: testing Flye 2.7b-b1562 on sample with no disjointigs assembled - still running...

ptrebert commented 4 years ago

@fenderglass For my problematic sample, flye 2.7b did not solve the issue (same "no disjointigs assembled"). I followed your suggestion and used your pbclip tool, which finished and reported the following:

Good: 15725667 chopped: 409754 bad: 662955

Could you help with interpreting these numbers (I may want to get in touch with the seq lab about this sample)? I'll try to assemble to output FASTA now with flye v2.7b, let's see what happens.

mikolmogorov commented 4 years ago

@ptrebert

pbclip finds PacBio reads that were not properly split into subreads. Depending on the DNA library, polymerase might make multiple passes over the fragment (which is used to produce high quality CCS reads). However, fragments in CLR libraries (at least from the assembly perspective) are not expected to be read multiple times to produce longer reads. When multiple passes does happen, such reads should be split into subreads (each subread is a single polymerase pass). Typically this is handled by the PacBio software at the FASTQ generation stage.

The numbers suggest that ~40% of your reads have multiple polymerase passes. This is a lot (typical value could be 1-2%) and suggests that there is indeed an issue with subread splitting. The number of chopped reads are those reads that pbclip was able to split into parts successfully. The bad reads are the reads with the same pattern that pbclip was not able to recover.

Feel free to run the latest Flye version on the output produced by pbclip - I think it it should work now. You can also double check with the lab if they performed subread splitting or have raw PacBio files to regenerate valid Fastqs.

ptrebert commented 4 years ago

@fenderglass Thanks a lot for your detailed explanation. I am not sure, however, I can follow your argument about the 40% "bad" reads: Total: 16798376 Bad = chopped + bad = 409754 + 662955 = 1072709 % bad = 1072709 / 16798376 ~6.4% Am I missing something, or did you just misread the "bad" number as 6 million instead of 600k? In either case, thanks again for all your input, that is very valuable. I'll update this issue as soon as I have the 2.7b results for the corrected reads.

ptrebert commented 4 years ago

probably last comment regarding this: even with the corrected reads (FASTA input now), flye 2.7b fails to assemble disjointigs. Seems like there is something else off about this data...

mikolmogorov commented 4 years ago

@ptrebert I see - this could be tricky sometimes. Did you have any luck with other assemblers? Wtdbg2 might be a fast way to check.

ptrebert commented 4 years ago

@fenderglass If I find the time, I'll try another assembler. For now, I asked the sequencing centre to double-check everything about this particular sample, let's see if they find something...

ptrebert commented 4 years ago

@fenderglass A postdoc in the sequencing center that produced the problematic data in the first place ran a couple of tests with different input combinations, and also with wtdbg2 as a comparison. Since none of those test runs produced an assembly, it seems fairly clear that the problem is the data. Just out of curiosity, since we have all the flye logs for the different runs, is there any statistic in those log files that could tell us anything about the problem(s) in the data? To me, they all look pretty similar (well, they all failed), so just being thorough here...

mikolmogorov commented 4 years ago

@ptrebert good to know, thanks for the update! At this early stage of assembly, not much could be inferred from the logs, I think.. I guess it the log shows that "Overlap-based coverage" is reasonable (let's say, >10), but no disjointigs are produced, then there is a problem somewhere.

ptrebert commented 4 years ago

No, they all show a zero for the "overlap-based coverage". Whatever the problem is, it's in the data then... thanks for all your support!

vappiah commented 4 years ago

Hello All, I am working an Mycobacterium ulcerans genome which was sequenced with oxford nanopore technology. I am trying to do denovo assembly with flye but I run into a warning and the pipeline stops . The command I used is
flye --nano-raw filename.fa -o outdir -g 0.05m -t 34 -i 2

I get this message below

WARNING: Expected read coverage is 4744, the assembly is not guaranteed to be optimal in this setting. Are you sure that the genome size was entered correctly? Pipeline aborted

mikolmogorov commented 4 years ago

@jotes35 your expected genome size is 50kb (0.05 Mb). It needs to be "5m", not "0.05m" (assuming you are aiming for 5 Mb genome).

vappiah commented 4 years ago

Please is there a way to know the expected genome size before hand?

vappiah commented 4 years ago

@fenderglass is there a way to know the expected genome size before starting the assembly?

mikolmogorov commented 4 years ago

@jotes35 Please check the FAQ - it provides some answers to your question. Let me know if anything us unclear.

eyayd commented 4 years ago

Hello, I have the same problem "No disjointigs were assembled". Expected genome is 110M and my expected coverage is about 49, I tried --meta and different --asm-coverage (since my over all coverage is smaller than 50x) but it didn't solve the issue. My N50 is quite high, would that be the reason I am getting the error? P40.pdf

vappiah commented 4 years ago

@eyad. This is what worked for me I looked up the genome size of my organism (in my case 6.5mb)In the flye software, Flye still raised the flag. I reduced to 5M and the message did not come up again.

mikolmogorov commented 4 years ago

@eyayd could you post the log of the run with --meta option?

eyayd commented 4 years ago

Thank you very much for your prompt reply!

I am afraid I don't have it anymore. I am re-running it now, will post it asap. I am also currently trying the 2.7.1 version.

I have another nanopore run of the same genome which has less coverage and a bit smaller N50. Flye finds less overlaps and runs with no error. I am posting the log of that sample, incase helpful. N6_G344.pdf

eyayd commented 4 years ago

@eyayd could you post the log of the run with --meta option?

P40.pdf

mikolmogorov commented 4 years ago

@eyayd Somethig might not be right with your sample. Your expected genome size is 100m, and the coverage should be roughly 50x. But based on overlaps, the coverage is 600x - so this does not add up. No disjintings were assembled, which means that even though there were sufficient coverage, there were no reads that could be joined into contigouos fragments. For example, this is what you might see from amplicon sequenceing or PCR-based selection.

If you could share more details about your sample, I might have more insights.

eyayd commented 4 years ago

@fenderglass I am using uncorrected nanopore data, library was prepped with ligation kit, no PCR amplification. It is a previously sequenced haploid genome 110 Mb, with 17 chromosomes. I have sequenced more than one sample of the same organism and earlier 4 sample are assembled with Flye with no problem. The difference, I could tell, between the earlier samples and this one is that the starting DNA sample had higher Molecular Weight (longer pieces) and I have more nanopore data and higher N50/N90 values. I hope this info is helpful, I am happy to share more if you have specific questions.

eyayd commented 4 years ago

@fenderglass Hello, after your comment, I mapped the reads on the reference genome and indeed there was something wrong with my samples, contamination. I used only the reads that mapped to the reference genome and ran a Flye assembly. It finished with no error. Thank you very much for your time and comments.

mikolmogorov commented 4 years ago

@eyayd glad that it worked, thanks!

ChristopherRichie commented 3 years ago

I was getting this "No Disjointigs were assembled" message on reads from small plasmids (~6k). Adding these arguments seemed to work to get a reasonable first draft: -m 1000 --genome-size 6k --asm-coverage 50.

Thanks!

pbuendia commented 3 years ago

I am also getting "ERROR: No disjointigs were assembled". My reads are from RNA-Seq with MinION. I removed human reads by classifying with kraken first. I assumed --meta will assemble the reads into different genomes but it seems to assemble all into one. How do I achieve my objective? Is there another program I have to use first? The bacterial genome sizes are all over the place and there are plasmid reads too. My call was flye --meta --nano-raw barcode02_nanofilt.fastq --out-dir out_SISPA2-b2 --threads 16

mikolmogorov commented 3 years ago

@pbuendia could you provide more information about the dataset? How much read sequence are you using to assemble? Is this human or bacterial RNA? metaFlye was not designed for RNA assembly, and there could definitely be some additional challenges (e.g. alternative splicing in case oh human). Your command line is correct though - so if the assembly is not produced it is possible that there is nothing to assemble (for example, if the coverage is not sufficient).

pbuendia commented 3 years ago

@fenderglass : My initial run had only 100k reads so definitely coverage was not sufficient. I retried it now with a 3.5 million reads data set. I got Final assembly: assembly.fasta with a sequence of length 142. Not what I expected. The draft_assembly.fasta has one disjointig with 7460 bp. Can Flye assemble microbial genomes if the reads are from a microbiome containing thousands of species? Or does Flye only work for reads from a few species? To be clear I used Kraken2 to classify the reads, then removed the reads that were classified as homo sapiens. I can attach the log if needed.

mikolmogorov commented 3 years ago

@pbuendia Flye can assemble many species from a metagenome is there is sufficient read coverage (e.g. 10x+ per genome at minimum). It is likely in your case you don't have coverage, which is not uncommon for large metagenomes.

If you can give more details about your dataset (e.g. see my previous questions) - I might have more insights. You mentioned RNA-seq previously, is it a different dataset now?

pbuendia commented 3 years ago

@fenderglass : Yes, of course, that explains it! Coverage is still insufficient. Thanks! And it is a different dataset, from RNA-Seq of 1 sample. Also, the reads are shorter than the usual nanopore length (average 450bp). If it requires 10x coverage, 3.5 million reads is obviously not enough. There are for example 464,076 reads classified as Rothia mucilaginosa whose genome is of length 2.26-Mb.

stanislasmorand commented 3 years ago

Dear Mikhail & Flye users, I confirm that adding --meta in the command line has solved my previous "No disjointigs were assembled" message (obtained with 10Gbp through 1.577M nanopore reads from a Micrococcus strain gDNA). flye --nano-raw ../nanopore.fastq.gz --meta--genome-size 2.5m --threads 20 --out-dir ./ Cheers, Stan

vappiah commented 3 years ago

Hello all, I have a similar issue . What flye assembly worked on the original data but after running through kraken and removing some sequences I had this error. The below message was also displayed. please check if the read type and genome size parameters are correct

Adding the --meta flag did not solve the problem. Please advice. Thank you

Here is the full command flye --nano-raw ZP45.cleaned.fa --meta -o assemblydir/ZP45 --plasmids -g 5.7m -t 15 -i 10

fiddlinwill commented 3 years ago

Hi, similar problem.
I'm practicing assembly on a previously published pacbio dataset: SRX3461807, Pacbio RS2 (https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=421950)

I have attempted the following scripts, among others based on suggestions in this chat: --pacbio-raw bmcatpac.fq --out-dir flye5 -g 80m --asm-coverage 10 --threads 16 --pacbio-raw bmcatpac.fq --out-dir flye4 -g 80m --asm-coverage 50 --threads 16 --pacbio-raw bmcatpac.fq --out-dir flye3 --meta --threads 16

Each time I get something like the below error:

[2021-01-27 22:16:27] DEBUG: Estimating overlap coverage [2021-01-27 22:18:44] INFO: Overlap-based coverage: 0 [2021-01-27 22:18:44] INFO: Median overlap divergence: 0.264542 [2021-01-27 22:18:44] DEBUG: Sequence divergence distribution:

|                                                *                        |                          
|                                                *                        |                          
|                                                *                        |                          
|                                                *                        |                          
|                                          ***   *                 *      |                          
|                                          ****  **          *     *      |                          
|                                        * ****  **  *       *     *      |                          
|                                        * ***** ** **      **    ***     |                          
|                                      * * ******** **   *  ***   *** **  |                          
|                                      * * ******** **   *  ***   *** **  |                          
|                                      * * ***********   ** *** ***** ** *|                          
|                                      * * ************* ** *** ***** ** *|                          
|                                      * * ************* ** *** ***** ** *|                          
|                                    *** *************** ** *** ***** ** *| *                        
|                                    ******************* ****** ***** ** *| **                       
|                                    ******************* ****** ******** *|***                       
|                                  * ******************************************                      
|                               ** * ******************************************                      
|                         *     **** ******************************************                      
|                         *     **** ******************************************                      
----------------------------------------------------------------------------------------------------
0%        5%        10%       15%       20%       25%       30%       35%       40%       45%       

Q25 = 0.23, Q50 = 0.26, Q75 = 0.32

[2021-01-27 22:35:34] INFO: Assembled 0 disjointigs [2021-01-27 22:35:38] INFO: Generating sequence [2021-01-27 22:35:38] DEBUG: Writing FASTA [2021-01-27 22:35:42] DEBUG: Peak RAM usage: 40 Gb -----------End assembly log------------ [2021-01-27 22:35:43] root: ERROR: No disjointigs were assembled - please check if the read type and genome size parameters are correct

My hpc cluster does not have your pbclip installed yet. Would pbclip be the next step (as you have mentioned above)? I'm fairly new to command line and so I'm not sure where to direct our managers for an install on the cluster. It's not part of the flye package, is it?

Thank you

Will

mikolmogorov commented 3 years ago

@vappiah if Flye worked on the data before filtration, it is very likely that the remaining reads after filtration don't have sufficient coverage for assembly.

mikolmogorov commented 3 years ago

@fiddlinwill if you downloaded this data using fastq-dump, it is likely that it indeed has issue for subreads not being separated. This was a common issue for many old (2 years+) PacBio submissions to NCBI. The best way to process would be to download the original h5 PacBio files and extract reads manually using DEXTRACTOR (https://github.com/thegenemyers/DEXTRACTOR).

Alternatively, pbclip might help too - but it has not been extensively tested. It it not a part of Flye package and needs to be compiled from source. Please post an issue at the pbclip repository if you have any issues with installation.

fiddlinwill commented 3 years ago

Thanks for the suggestion! Will try this today

William Sears MD, MHS NIAID, Laboratory of Parasitic Disease Bldg 4 Room 211B 4 Center Dr. National Institutes of Health Bethesda, MD 20892-0425 Mobile: 3013129551 Email: william.sears@nih.gov

From: Mikhail Kolmogorov notifications@github.com Reply-To: fenderglass/Flye reply@reply.github.com Date: Monday, February 1, 2021 at 1:18 PM To: fenderglass/Flye Flye@noreply.github.com Cc: "Sears, William (NIH/NIAID) [E]" william.sears@nih.gov, Mention mention@noreply.github.com Subject: Re: [fenderglass/Flye] Flye does not generate any output ("No disjointigs were assembled" message) (#128)

@fiddlinwillhttps://github.com/fiddlinwill if you downloaded this data using fastq-dump, it is likely that it indeed has issue for subreads not being separated. This was a common issue for many old (2 years+) PacBio submissions to NCBI. The best way to process would be to download the original h5 PacBio files and extract reads manually using DEXTRACTOR (https://github.com/thegenemyers/DEXTRACTOR).

Alternatively, pbclip might help too - but it has not been extensively tested. It it not a part of Flye package and needs to be compiled from source. Please post an issue at the pbclip repository if you have any issues with installation.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/fenderglass/Flye/issues/128#issuecomment-771054785, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ASUCLB5XTPRZYY4UVY2QK63S43V3XANCNFSM4H22HVOQ.

caballero commented 3 years ago

Hi, I am also having this issue with a public metagenomic dataset (in particular this one: https://www.ebi.ac.uk/ena/browser/view/SRR10431755), I am using Flye 2.8.3, with --meta and --plasmids params activated, the error log is:

[2021-04-13 18:51:21] INFO: Starting Flye 2.8.3-b1695
[2021-04-13 18:51:21] INFO: >>>STAGE: configure
[2021-04-13 18:51:21] INFO: Configuring run
[2021-04-13 18:51:22] INFO: Total read length: 3632951
[2021-04-13 18:51:22] INFO: Reads N50/N90: 1009 / 312
[2021-04-13 18:51:22] INFO: Minimum overlap set to 1000
[2021-04-13 18:51:22] INFO: >>>STAGE: assembly
[2021-04-13 18:51:22] INFO: Assembling disjointigs
[2021-04-13 18:51:22] INFO: Reading sequences
[2021-04-13 18:51:28] INFO: Counting k-mers:
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2021-04-13 18:52:30] INFO: Filling index table (1/2)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2021-04-13 18:52:30] INFO: Filling index table (2/2)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2021-04-13 18:52:32] INFO: Extending reads
[2021-04-13 18:52:32] INFO: Overlap-based coverage: 1
[2021-04-13 18:52:32] INFO: Median overlap divergence: 0.128596
0% 100% 
[2021-04-13 18:52:32] INFO: Assembled 0 disjointigs
[2021-04-13 18:52:32] INFO: Generating sequence
[2021-04-13 18:52:32] ERROR: No disjointigs were assembled - please check if the read type and genome size parameters are correct
[2021-04-13 18:52:32] ERROR: Pipeline aborted

I checked the reads with pbclip (even when it is MinION):

./pbclip SRR10431755_1.fastq.gz > clip.fasta
Sequences loaded
Good: 82026 chopped: 374 bad: 12
mikolmogorov commented 3 years ago

@caballero not enough coverage to assemble. metaFlye needs at least 5-10x coverage for a bacterial genome to assemble. The total size of your reads is ~3.6Mb, so even if there was a single bacterium, it would only have ~1x read coverage. Read length is also very short.

caballero commented 3 years ago

@fenderglass thanks for checking

326reborn commented 3 years ago

Hi, I met a similar problem. At first, I assembled a metagenome using HiFi reads (ccs) with commond: flye --pacbio-hifi HiFi.fa -t 10 --meta --plasmids -o metaflye_result. The flye.log flye1.log showed that the overlap-based coverage is 0 and get the same problem "No disjointigs were assembled". Then, the reads was extracted by mapping to ref genome (minimap2). The reads whose mapping identity lower than 97% was filtered. I assembled two species with same command: command1: flye --pacbio-hifi HiFi.fa -t 10 --meta --plasmids -o metaflye_result command2: flye --pacbio-hifi HiFi.fa -t 10 -g 100m --asm-coverage 40 -o metaflye_result2 Interestingly, specie1 successfully assembled with 3.2G HiFi reads while specie2 get the same problem "No disjointigs were assembled" with 6.6G HiFi reads. flye_fail2_specie2.log flye_fail1_specie2.log

mikolmogorov commented 2 years ago

@326reborn sorry for the late response! I suggest trying the latest code from github (you will need to compile it and run locally). It contains some updates that may have fixed the issue.