Open devonorourke opened 6 years ago
Two guesses:
I'm happy to dig deeper myself. Assuming the data isn't too sensitive, you could send me some relevant files (the overlaps_removed gfa file from the short-read-only assembly and the final long-read-only assembly gfa) and I'll take a look.
Ryan
Hi Ryan,
I've tried a few new things with this one weird sample and have one simple conclusion: I think I had too much Nanopore data.
original.fastq
file had:
original.fastq
to include only reads > 10,000 bp (min10kb.fastq
), but there was still over 80,000 reads. I assembled that genome and it was sitting in digital purgatory at that same stage as before, and ultimately after 2.5 days the compute cluster quit the job as it ran out of memory (~128 Gb RAM). original.fastq
to include only reads > 50,000 bp (min50kb.fastq
), and this time there was a substantial drop in read number (but still ~ 3x coverage of the genome):
The min50kb.fastq
didn't stall - it whipped through the entire assembly without issue. However because there were so few total reads the biggest of the two chromosomes wasn't closed up (4 segments).
I performed one other filtering strategy as a sort of middle-ground by keeping only reads > 25 kb (min25kb.fastq
), and that hybrid assembly finished about as quick as any of the other assemblies (under 2 hours using -t 24
. In that fastq file there were:
This all is making me wonder something about the message you post in each .log
file about the necessary throughput during the miniasm
stage of the hybrid assembly:
Assembling contigs and long reads with miniasm (2018-07-01 11:26:18)
--------------------------------------------------------------------
Unicycler uses miniasm to construct a string graph assembly using both the short read contigs and the long reads. It will then use the resulting string graph to pr
oduce bridges between contigs. This method requires decent coverage of long reads and therefore may not be fruitful if long reads are sparse. However, it does not rely
on the short read assembly graph having good connectivity and is able to bridge an assembly graph even when it contains many dead ends.
I would think that for speed of assembly sake, the fewer reads the better, with the tradeoff being that you need some long reads to make bridging effective. But given your message above there is probably also a minimum depth of coverage that we'd like to target. Perhaps this is explicitly mentioned somewhere in the user guide and I'm just missing it - what value should we be shooting for?
In this rare case where I have tons and tons of reads, where does the fulcrum rest in the balancing of speed efficiency and bridging success? I can filter out most reads and keep only (few) ultra long reads; I can filter out small and long reads and keep only moderate reads a a higher throughput; I can wait for days to get a super high memory node on my compute cluster and just use all of them (though that option seems silly to me).
Thanks for your help!
Unicycler builds bridges with long reads in two ways:
--no_miniasm
For some rough guesses, I'd say at least 20x long read depth of coverage or more is ideal, and you may continue to see gains up to about 50-100x depth. But if your Illumina reads are really good, then you can get away with less because Unicycler can rely more heavily on option 2. I've seen assemblies finish with as little as 4x depth, but that takes a fair bit of luck. If I have more than 100x depth, I subsample for length and quality (I use Filtlong, but I'm biased on the matter).
Your read files are large, but not insanely large. I wouldn't expect Unicycler to choke on them the way your describe, and it definitely shouldn't eat up 128 GB of RAM! So I still suspect something is going wrong.
I'd suggest trying a long-read-only assembly, possibly with Canu. Then compare it to your min25kb.fastq hybrid assembly. If they are in general agreement, then you can be confident that Unicycler didn't totally botch it. But if they disagree significantly, more investigation is warranted. In particular, I'd fear that your Illumina reads and Nanopore reads are of slightly different genomes, and that's what threw off Unicycler.
Ryan
Thanks for the detailed response Ryan.
I've assembled this sample (and others) with:
Is there a specific metric (or ten) that you would advise investigating? My tasks this morning are to:
samtools flagstat
to investigate some items like properly paired reads Happy to share any info you think might be of interest.
Yes, all of those things you said could be informative. I would also try aligning the assemblies to each other: Mauve is good if you want a GUI tool, and MUMmer is a good CLI option. When the assemblies agree with each other, that gives me confidence that they are correct. If they disagree, then more investigation is warranted. It's especially worth looking closely at the specific places where they disagree.
Hi Ryan, Thanks to all your contributions to the microbial genomics community! Hopefully, you can also help me figure a weird problem I'm having that's related to this issue. I'm building a snakemake pipeline that uses unicycler as the assembler. I too am running into a similar issue where it seems to repeatedly get stuck on Racon bridges step. I don't think my computer is frozen since htop is active but unicycler log has yet to update in several hours. I don't think it's anything with the snakemake syntax as some samples have ran without any problems, which leaves me a little stuck on where it might be getting stuck. The dataset I'm building it off of is actually from one of your papers (https://github.com/rrwick/Bacterial-genome-assemblies-with-multiplex-MinION-sequencing). Is there anything I should look into in particular?
Hey Ryan, I am having a similar issue with a Snakemake pipeline that aims to assembly mitochondrial genomes using long and short reads, except my Unicycler run is freezing on the "Building long read bridges" step. I will try to filter down the long reads to reduce coverage (which sits at about 3800x; while short reads sit at about 2500x). Perhaps there is just too much data? Thanks, Aaron :)
Hi Ryan, Things continue to work exceptionally well with Unicycler - thank you for such an easy yet powerful program. I'm getting hung up trying to complete one last sample (which is both pretty good luck yet incredibly frustrating to be so close to the finish!). I'm wondering if you've ever seen this phenomenon before?
hybrid
assembly on a single sample (Illumina + Nanopore reads) which fully completes spades assembly (001_best_spades_graph.gfa
), as well as the overlapping (002_overlaps_removed.gfa
) and long read assembly (003_long_read_assembly.gfa
) sections.Creating miniasm/Racon bridges
- but gets no where. No new information is written into the log file, and no newer files are ever created (ie. no004_final_clean.gfa
or005
orassembly.gfa/fasta
).Here's the really strange part. This only happens for one specific sample.
Here's the tail of the log file for the hybrid assembly, which I've reproduced a number of times now and always looks exactly the same:
Thanks for any advice you can offer.