Open jelber2 opened 1 month ago
I agree with your diagnosis that insufficient coverage is the main culprit here. Read accuracy may also be a factor, but it is hard to tell given the low coverage and without knowing more about the reads. Did you have a chance to map your reads to the reference haplotypes to get an idea of their accuracy?
Regarding coverage, this assembly is using 72 Gb of coverage, which corresponds to 23x for a nominal genome size of 3.1 Gb. The two assemblies described in the presentation that accompanies Shasta 0.12.0 used coverage 38x and 58x. We have not yet had the time to study how low you in coverage you can go and still get a useful assembly.
In addition, the reads we used, which included the original ONT reads from their December 2023 data release plus some reads with similar characteristics generated at U. C. Santa Cruz, had a read N50 around 100 Kb after discarding the ones shorter than 10 Kb. That is much longer than the 21 Kb read N50 reported in the log you attached. Do you know why your sequencing pipeline is generating reads so short? The "Ultra-Long" (UL) aspect of the ONT December 2023 release is an important one.
As a result of low coverage and short reads, the read graph only includes 5.6 million alignments for 4.7 million reads, a ratio barely above 1. In a healthy assembly, that ratio should be at least 4 or preferably 6. With such a small number of alignments, the read graph is extremely fragmented and essentially no assembly is taking place.
Even under these conditions, the assembly should not assert, and I will look at fixing that. But that will simply result in a more graceful termination, still without a useful assembly.
I don't think switching to the default 10 Kb read length cutoff will help you, because based on the log you attached only 283 Mb of coverage in reads shorter than 4 Kb were discarded. So increasing the read length cutoff from 4 Kb to 10 Kb will not add much coverage.
If you can confirm by mapping that the accuracy of your reads is comparable to the accuracy of the ONT December 2023 release, then if you are able to double the amount of available coverage you would have similar coverage to our 38x assembly, and it would be interesting to see if you can still get useful assembly results despite the shorter reads.
Thank you for reporting this. We decided to release Shasta 0.12.0 even tough the code has not completely stabilized, precisely to encourage the kind of experimentation you are doing.
The "bug" label refers to the assertion, which should not happen even under insufficient coverage conditions.
One more comment. We also experimented with a variety of error correction schemes, and we found that all the ones we tried resulted in lower quality of the resulting Shasta assembly. Our working hypothesis is that the error correction schemes are not sufficiently haplotype aware and therefore end up confusing the phasing process, rather than helping it. Assuming your read accuracy is comparable to the one in the ONT data release, I recommend that you use uncorrected reads. That assembly configuration and the computational methods used by Shasta are perfectly capable to deal with the error rates present in reads of that accuracy.
Wow, thank you very much for your fast and very thorough analysis. Based on aligning these reads to both haplotypes of hg002v1.0.1.fasta (yes, it is possible for split alignments of one read to both haplotypes I suppose), I got the following alignment identities from BEST (https://github.com/google/best)
The reads were just one our higher outputs from the Ligation Sequencing kit runs on a single PrometION flow cell (not using Ultra-long sequencing ligation kit for library prep). The reads I used for SHASTA are the bottom reads in the table.
total_alns primary_alns identity identity_qv gap_compressed_identity matches_per_kbp mismatches_per_kbp non_hp_ins_per_kbp non_hp_del_per_kbp hp_ins_per_kbp hp_del_per_kbp
6805132 6751494 0.994539 22.627025 0.994579 994.562660 5.376677 0.016667 0.029777 0.007443 0.030885 Illumina2x250bp HG002
10116920 4863378 0.999642 34.455572 0.999765 999.730783 0.049120 0.024981 0.049600 0.064296 0.170496 our_ONT_HG002_SUP_Herro
9874513 4780152 0.999748 35.989081 0.999845 999.808384 0.025497 0.015097 0.028875 0.045124 0.137243 our_ONT_SUP_Herro_br_pg_asm2x
4786342 4780152 0.999758 36.153250 0.999852 999.816397 0.022937 0.014384 0.026442 0.044507 0.134224 ONT_SUP_Herro_br_pg_asm_dechat
That is a very nice overall accuracy improvement, and confirms that the fragmentation in the read graph is caused by low coverage. But those numbers by themselves don't tell us how faithful to haplotypes the error correction process is. I say this because in our experimentation with various error correction schemes we found that all of them were resulting in a worse assembly, presumably because of the process being not sufficiently haplotype aware.
Shasta uses computational techniques designed for the low accuracy reads of a few years ago, and so it is perfectly capable of dealing with errors in the reads. If error correction is used, any errors during that phase are baked in, cannot be recovered during assembly, and can confuse the phasing process. That is why I recommend using uncorrected reads as input to Shasta.
We hope to generate, in a future release, an assembly configuration for Mode 3 assembly designed for regular ONT R10 reads.
In any event, I suggest generating more coverage if possible, and then trying again.
Sorry for the delay in response. We might be able to generate more coverage (or already have more reads at a slightly longer N50), but this might take some time. I really appreciate your thoughtful and quick responses.
I am reopening this issue only because I have a method for estimating read level phase-switching that (if really properly working) seems to suggest there may be very low phase-switching at the read level for Nanopore r10.4.1 SUP reads corrected with Herro (I have not yet tested the further correction methods of brutal rewrite, peregrine-2021, and dechat - these are the reads that I reported in this thread). https://gist.github.com/jelber2/45225fa0937b59f72bbdddf8d694eb70 If you have a chance, I would appreciate your thoughts or if you would like more explanations of the workflow. The first part of the gist is the analysis script and then later a diagnostic histogram.
Since the recent ONT announcement we have been experimenting with the R10.4.1 reads they released, both in raw and error corrected versions (with Herro). We have indeed seen preliminary indications that error correction does improve things and appears to be more haplotype aware than we have seen in the past. If this is confirmed, it is likely that we will release very soon a new Shasta version with two new assembly configurations for R10.4.1 reads, one for raw reads, and one for reads corrected with Herro (as incorporated by the ONT basecaller/workflow process).
Please don't close this issue. I added the "Bug" label because of the assertion you encountered in your assembly. That needs to be fixed. At the very least the code should end with a proper error message.
I am just reporting that after adding more sequences to the original reads reported previously (the new, added reads have an N50 25Kbp, though), there were:
Total number of raw bases is 137407791376.
~ 44x coverage
and there was an assembly produced, so I am presuming it was as you suggest, a coverage issue, that 23x coverage was not enough with that preset.
Thank you for reporting this.
At this point we are focusing on the new Nanopore r10.4.1 SUP reads, also error corrected with Herro, which as an added benefit are also much longer. We are working on optimizing the assembly configuration for those reads.
@jelber2 I always aim for 40X + coverage for good shasta assemblies with N50 > 10mbp. That said, I do not monitor input coverage that closely, but that is what I expect our providers to deliver. With 40x+ coverage I continually get very good assemblies across a range of plants.
We are starting to look at Herro/Dorado correct and this looks very positive so far, especially as coverage seems to not be adversely impacted.
Yes, I agree with 40X+ from what I have seen so far with Shasta. But with these latest reads with much higher quality it might be possible to lower coverage and still get reasonably good assemblies. We have not looked at this yet, but at some point we plan to experiment with downsampling and understand how much assembly quality degrades at low coverage, and to what extent you can get away with low coverage.
Hi, I am trying some presumably accurate Nanopore reads that I have error-corrected (see https://gist.github.com/jelber2/f22c24442c34f872d8ebf073ad721476?permalink_comment_id=5058115#gistcomment-5058115) with the following command
I ended up getting the following
It looks like a straight forward error that the bubble chain length is 0; therefore, that throws a error later.
I will report back with using default
Nanopore-ncm23-May2024
settings. The reads are from HG002, and perhaps (1) they are not accurate enough for this configuration , and/or (2) coverage is too low.**EDIT you can get the reads here
**EDIT 2 Using default settings also resulted in