paoloshasta / shasta

De novo assembly from Oxford Nanopore reads.
https://paoloshasta.github.io/shasta/
Other
61 stars 9 forks source link

Total non-trivial bubble chain length for this component 0, N50 0 #26

Open jelber2 opened 1 month ago

jelber2 commented 1 month ago

Hi, I am trying some presumably accurate Nanopore reads that I have error-corrected (see https://gist.github.com/jelber2/f22c24442c34f872d8ebf073ad721476?permalink_comment_id=5058115#gistcomment-5058115) with the following command

/msc/home/jelber43/git/shasta-Linux-0.12.0 --config Nanopore-ncm23-May2024 \
--threads 238 --input WGS_HG002_EZ1_10kb_SUP_herro_brk19_pg_asm2x_dechat_recorrected.fasta \
--assemblyDirectory WGS_HG002_EZ1_10kb_SUP_herro_brk19_pg_asm2x_dechat_recorrected_Shasta --Reads.minReadLength 4000

I ended up getting the following

head -n 100 stdout.log

2024-May-16 11:47:55.305366 Assembly begins.
Command line:
/msc/home/jelber43/git/shasta-Linux-0.12.0 --config Nanopore-ncm23-May2024 --threads 238 --input WGS_HG002_EZ1_10kb_SUP_herro_brk19_pg_asm2x_dechat_recorrected.fasta --assemblyDirectory WGS_HG002_EZ1_10kb_SUP_herro_brk19_pg_asm2x_dechat_recorrected_Shasta --Reads.minReadLength 4000 
For options in use for this assembly, see shasta.conf in the assembly directory.
This run uses options "--memoryBacking 4K --memoryMode anonymous".
This could result in longer run time.
For faster assembly, use "--memoryBacking 2M --memoryMode filesystem"
(root privilege via sudo required).
Therefore the results of this run should not be used
for the purpose of benchmarking assembly time.
However the memory options don't affect assembly results in any way.
This assembly will use 238 threads.
Discarded read statistics for file /msc/home/jelber43/WGS_HG002_EZ1_10kb_SUP_NEW/WGS_HG002_EZ1_10kb_SUP_herro_brk19_pg_asm2x_dechat_recorrected.fasta:
    Discarded 0 reads containing invalid bases for a total 0 valid bases.
    Discarded 89321 reads shorter than 4000 bases for a total 286615717 bases.
    Discarded 0 reads containing repeat counts 256 or more for a total 0 bases.
Discarded read statistics for all input files:
    Discarded 0 reads containing invalid bases for a total 0 valid bases.
    Discarded 89321 short reads for a total 286615717 bases.
    Discarded 0 reads containing repeat counts 256 or more for a total 0 bases.
Read statistics for reads that will be used in this assembly:
    Total number of reads is 4690831.
    Total number of raw bases is 71567931555.
    Average read length is 15257 bases.
    N50 for read length is 21135 bases.
Found 0 reads with duplicate names.
Discarded from the assembly 0 reads with duplicate names.
Flagged 1676 reads as palindromic out of 4690831 total.
Palindromic fraction is 0.000357293
log2MinHashBucketCount reduced from 32 to maximum allowed value 31.
LowHash0 algorithm will use 2^31 = 2147483648 buckets. 
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 0: high frequency 62553813, total 104575961, capacity 104575961.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 1: high frequency 88779347, total 110615878, capacity 150207283.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 2: high frequency 96839779, total 113386678, capacity 162601362.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 3: high frequency 100713715, total 115258492, capacity 168919003.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 4: high frequency 103092675, total 116741184, capacity 172976286.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 5: high frequency 104739231, total 117961515, capacity 175883439.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 6: high frequency 105959554, total 119007199, capacity 178122905.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 7: high frequency 106917466, total 119958801, capacity 179981738.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 8: high frequency 107695936, total 120851283, capacity 181617776.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 9: high frequency 108344020, total 121663600, capacity 182995065.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 10: high frequency 108892502, total 122394279, capacity 184204195.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 11: high frequency 109372843, total 123120732, capacity 185335166.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 12: high frequency 109786601, total 123787250, capacity 186344668.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 41
Alignment candidates after lowhash iteration 13: high frequency 110192988, total 124613020, capacity 187506398.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 14: high frequency 110525978, total 125213304, capacity 188378446.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 15: high frequency 110837214, total 125800029, capacity 189194914.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 16: high frequency 111114486, total 126345749, capacity 189958480.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 17: high frequency 111370286, total 126872670, capacity 190686570.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 18: high frequency 111611125, total 127360904, capacity 191361142.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 19: high frequency 111833260, total 127853128, capacity 192015274.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 20: high frequency 112045361, total 128304633, capacity 192628918.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 21: high frequency 112244510, total 128754336, capacity 193224569.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 22: high frequency 112435734, total 129183788, capacity 193778168.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 23: high frequency 112616655, total 129603746, capacity 194327140.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 24: high frequency 112792808, total 130003606, capacity 194854017.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 25: high frequency 112959916, total 130393674, capacity 195363672.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 26: high frequency 113117917, total 130778400, capacity 195849107.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 27: high frequency 113271456, total 131145127, capacity 196312799.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 28: high frequency 113426625, total 131508354, capacity 196778190.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 29: high frequency 113572990, total 131861843, capacity 197221253.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 30: high frequency 113717100, total 132208425, capacity 197649053.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 31: high frequency 113853003, total 132540513, capacity 198058842.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 32: high frequency 113984721, total 132867164, capacity 198469855.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 33: high frequency 114116761, total 133171730, capacity 198852418.
Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40

tail  -n +130 stdout.log|head -n25

Automatic settings for this iteration: minBucketSize 3, maxBucketSize 40
Alignment candidates after lowhash iteration 49: high frequency 116038021, total 137217948, capacity 203859089.
Found 116038021 alignment candidates.
Average number of alignment candidates per oriented read is 24.7372.
Number of alignment candidates before suppression is 116038021
Suppressed 0 alignment candidates.
Number of alignment candidates after suppression is 116038021
2024-May-16 11:58:10.974440 Computing unique markers.
2024-May-16 11:58:21.134269 Alignment computation begins.
2024-May-16 12:00:06.398015 Alignment computation completed.
Found and stored 5577379 good alignments.
Keeping 5560007 alignments of 5577379
Flagged 17208 reads as chimeric out of 4690831 total.
Chimera rate is 0.00366843
Strand separation flagged 1532 read graph edges out of 11120014 total.
Kept 3892038438 disjoint sets with coverage in the requested range.
Found 0 disjoint sets with more than one marker on a single oriented read or with less than 0 supporting oriented reads on each strand.
Found 3897976342 edges for 3892038438 vertices.
Automatically set: minPrimaryCoverage = 6, maxPrimaryCoverage = 15
Found 207408604 primary marker graph edges out of 3897976342 total.
Of 3897976342 marker graph edges, 207408604 are primary.
2024-May-16 12:31:40.345257 Assembling connected component 0 of 16019
This connected component has 1734 reads and 55875 primary marker graph edges.
The PrimaryGraph for this connected component has 55875 vertices and 58790 edges.
Assembly statistics by P-Value for component 0:

tail -n 100 stdout.log

P-value 0: total assembled length 0, N50 0
P-value 1: total assembled length 308761, N50 176378
P-value 2: total assembled length 427116, N50 113974
Combined for this component: total assembled length 735877, N50 113974
Total non-trivial bubble chain length for this component 522319, N50 522319
2024-May-16 12:35:13.624991 Assembling connected component 835 of 16019
This connected component has 187 reads and 15142 primary marker graph edges.
The PrimaryGraph for this connected component has 15142 vertices and 15260 edges.
Assembly statistics by P-Value for component 835:
P-value 0: total assembled length 537262, N50 296326
P-value 1: total assembled length 5362, N50 5362
P-value 2: total assembled length 54512, N50 27256
Combined for this component: total assembled length 597136, N50 153466
Total non-trivial bubble chain length for this component 32618, N50 32618
2024-May-16 12:35:13.793759 Assembling connected component 836 of 16019
This connected component has 187 reads and 15237 primary marker graph edges.
The PrimaryGraph for this connected component has 15237 vertices and 15354 edges.
Assembly statistics by P-Value for component 836:
P-value 0: total assembled length 0, N50 0
P-value 1: total assembled length 387510, N50 247038
P-value 2: total assembled length 218383, N50 33691
Combined for this component: total assembled length 605893, N50 93904
Total non-trivial bubble chain length for this component 496702, N50 496702
2024-May-16 12:35:14.129531 Assembling connected component 837 of 16019
This connected component has 187 reads and 18589 primary marker graph edges.
The PrimaryGraph for this connected component has 18589 vertices and 19023 edges.
Assembly statistics by P-Value for component 837:
P-value 0: total assembled length 0, N50 0
P-value 1: total assembled length 265857, N50 48943
P-value 2: total assembled length 504982, N50 61226
Combined for this component: total assembled length 770839, N50 61226
Total non-trivial bubble chain length for this component 518348, N50 518348
2024-May-16 12:35:14.279257 Assembling connected component 838 of 16019
This connected component has 187 reads and 17538 primary marker graph edges.
The PrimaryGraph for this connected component has 17538 vertices and 17902 edges.
Assembly statistics by P-Value for component 838:
P-value 0: total assembled length 73166, N50 21512
P-value 1: total assembled length 149655, N50 33629
P-value 2: total assembled length 548605, N50 156660
Combined for this component: total assembled length 771426, N50 68460
Total non-trivial bubble chain length for this component 423958, N50 423958
2024-May-16 12:35:14.443472 Assembling connected component 839 of 16019
This connected component has 187 reads and 7610 primary marker graph edges.
The PrimaryGraph for this connected component has 7610 vertices and 7688 edges.
Assembly statistics by P-Value for component 839:
P-value 0: total assembled length 665881, N50 37228
Combined for this component: total assembled length 665881, N50 37228
Total non-trivial bubble chain length for this component 0, N50 0
2024-May-16 12:35:14.611016 Assembling connected component 840 of 16019
This connected component has 187 reads and 11742 primary marker graph edges.
The PrimaryGraph for this connected component has 11742 vertices and 11941 edges.
2024-May-16 12:35:37.492812 Assertion failed: bubble.phase == 0 at void shasta::mode3::PhasingTable::greedyPhasing() in /home/runner/work/shasta/shasta/src/mode3-PhasingTable.cpp line 918

It looks like a straight forward error that the bubble chain length is 0; therefore, that throws a error later.

I will report back with using default Nanopore-ncm23-May2024 settings. The reads are from HG002, and perhaps (1) they are not accurate enough for this configuration , and/or (2) coverage is too low.

**EDIT you can get the reads here

# cram file
wget 'https://www.dropbox.com/scl/fi/d7f6gbr27atde95hdkibw/WGS_HG002_EZ1_10kb_SUP_herro_brk19_pg_asm2x_dechat_recorrected_Q30_sam1.4_SoftClip_against_hg002v1.0.1.fasta.cram?rlkey=3nv4y3b4ve1sgt1usmsnsztva&st=1rv7504y' -O WGS_HG002_EZ1_10kb_SUP_herro_brk19_pg_asm2x_dechat_recorrected_Q30_sam1.4_SoftClip_against_hg002v1.0.1.fasta.cram

# cram index
wget 'https://www.dropbox.com/scl/fi/hks0dzzn9pguo1gmgkmnz/WGS_HG002_EZ1_10kb_SUP_herro_brk19_pg_asm2x_dechat_recorrected_Q30_sam1.4_SoftClip_against_hg002v1.0.1.fasta.cram.crai?rlkey=xy15lumub7399gyxynvj8h3qy&st=pjxujgmh' -O WGS_HG002_EZ1_10kb_SUP_herro_brk19_pg_asm2x_dechat_recorrected_Q30_sam1.4_SoftClip_against_hg002v1.0.1.fasta.cram.crai

# reference is hg002v1.0.1.fasta
wget https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/HG002/assemblies/hg002v1.0.1.fasta.gz

**EDIT 2 Using default settings also resulted in

2024-May-16 12:35:37.492812 Assertion failed: bubble.phase == 0 at void shasta::mode3::PhasingTable::greedyPhasing() in /home/runner/work/shasta/shasta/src/mode3-PhasingTable.cpp line 918
paoloshasta commented 1 month ago

I agree with your diagnosis that insufficient coverage is the main culprit here. Read accuracy may also be a factor, but it is hard to tell given the low coverage and without knowing more about the reads. Did you have a chance to map your reads to the reference haplotypes to get an idea of their accuracy?

Regarding coverage, this assembly is using 72 Gb of coverage, which corresponds to 23x for a nominal genome size of 3.1 Gb. The two assemblies described in the presentation that accompanies Shasta 0.12.0 used coverage 38x and 58x. We have not yet had the time to study how low you in coverage you can go and still get a useful assembly.

In addition, the reads we used, which included the original ONT reads from their December 2023 data release plus some reads with similar characteristics generated at U. C. Santa Cruz, had a read N50 around 100 Kb after discarding the ones shorter than 10 Kb. That is much longer than the 21 Kb read N50 reported in the log you attached. Do you know why your sequencing pipeline is generating reads so short? The "Ultra-Long" (UL) aspect of the ONT December 2023 release is an important one.

As a result of low coverage and short reads, the read graph only includes 5.6 million alignments for 4.7 million reads, a ratio barely above 1. In a healthy assembly, that ratio should be at least 4 or preferably 6. With such a small number of alignments, the read graph is extremely fragmented and essentially no assembly is taking place.

Even under these conditions, the assembly should not assert, and I will look at fixing that. But that will simply result in a more graceful termination, still without a useful assembly.

I don't think switching to the default 10 Kb read length cutoff will help you, because based on the log you attached only 283 Mb of coverage in reads shorter than 4 Kb were discarded. So increasing the read length cutoff from 4 Kb to 10 Kb will not add much coverage.

If you can confirm by mapping that the accuracy of your reads is comparable to the accuracy of the ONT December 2023 release, then if you are able to double the amount of available coverage you would have similar coverage to our 38x assembly, and it would be interesting to see if you can still get useful assembly results despite the shorter reads.

Thank you for reporting this. We decided to release Shasta 0.12.0 even tough the code has not completely stabilized, precisely to encourage the kind of experimentation you are doing.

paoloshasta commented 1 month ago

The "bug" label refers to the assertion, which should not happen even under insufficient coverage conditions.

paoloshasta commented 1 month ago

One more comment. We also experimented with a variety of error correction schemes, and we found that all the ones we tried resulted in lower quality of the resulting Shasta assembly. Our working hypothesis is that the error correction schemes are not sufficiently haplotype aware and therefore end up confusing the phasing process, rather than helping it. Assuming your read accuracy is comparable to the one in the ONT data release, I recommend that you use uncorrected reads. That assembly configuration and the computational methods used by Shasta are perfectly capable to deal with the error rates present in reads of that accuracy.

jelber2 commented 1 month ago

Wow, thank you very much for your fast and very thorough analysis. Based on aligning these reads to both haplotypes of hg002v1.0.1.fasta (yes, it is possible for split alignments of one read to both haplotypes I suppose), I got the following alignment identities from BEST (https://github.com/google/best)

The reads were just one our higher outputs from the Ligation Sequencing kit runs on a single PrometION flow cell (not using Ultra-long sequencing ligation kit for library prep). The reads I used for SHASTA are the bottom reads in the table.

total_alns  primary_alns  identity  identity_qv  gap_compressed_identity  matches_per_kbp  mismatches_per_kbp  non_hp_ins_per_kbp  non_hp_del_per_kbp  hp_ins_per_kbp  hp_del_per_kbp
6805132     6751494       0.994539  22.627025    0.994579                 994.562660       5.376677            0.016667            0.029777            0.007443        0.030885         Illumina2x250bp HG002
10116920    4863378       0.999642  34.455572    0.999765                 999.730783       0.049120            0.024981            0.049600            0.064296        0.170496       our_ONT_HG002_SUP_Herro
9874513     4780152       0.999748  35.989081    0.999845                 999.808384       0.025497            0.015097            0.028875            0.045124        0.137243 our_ONT_SUP_Herro_br_pg_asm2x
4786342     4780152       0.999758  36.153250    0.999852                 999.816397       0.022937            0.014384            0.026442            0.044507        0.134224 ONT_SUP_Herro_br_pg_asm_dechat
paoloshasta commented 1 month ago

That is a very nice overall accuracy improvement, and confirms that the fragmentation in the read graph is caused by low coverage. But those numbers by themselves don't tell us how faithful to haplotypes the error correction process is. I say this because in our experimentation with various error correction schemes we found that all of them were resulting in a worse assembly, presumably because of the process being not sufficiently haplotype aware.

Shasta uses computational techniques designed for the low accuracy reads of a few years ago, and so it is perfectly capable of dealing with errors in the reads. If error correction is used, any errors during that phase are baked in, cannot be recovered during assembly, and can confuse the phasing process. That is why I recommend using uncorrected reads as input to Shasta.

We hope to generate, in a future release, an assembly configuration for Mode 3 assembly designed for regular ONT R10 reads.

In any event, I suggest generating more coverage if possible, and then trying again.

jelber2 commented 1 month ago

Sorry for the delay in response. We might be able to generate more coverage (or already have more reads at a slightly longer N50), but this might take some time. I really appreciate your thoughtful and quick responses.

jelber2 commented 1 month ago

I am reopening this issue only because I have a method for estimating read level phase-switching that (if really properly working) seems to suggest there may be very low phase-switching at the read level for Nanopore r10.4.1 SUP reads corrected with Herro (I have not yet tested the further correction methods of brutal rewrite, peregrine-2021, and dechat - these are the reads that I reported in this thread). https://gist.github.com/jelber2/45225fa0937b59f72bbdddf8d694eb70 If you have a chance, I would appreciate your thoughts or if you would like more explanations of the workflow. The first part of the gist is the analysis script and then later a diagnostic histogram.

paoloshasta commented 1 month ago

Since the recent ONT announcement we have been experimenting with the R10.4.1 reads they released, both in raw and error corrected versions (with Herro). We have indeed seen preliminary indications that error correction does improve things and appears to be more haplotype aware than we have seen in the past. If this is confirmed, it is likely that we will release very soon a new Shasta version with two new assembly configurations for R10.4.1 reads, one for raw reads, and one for reads corrected with Herro (as incorporated by the ONT basecaller/workflow process).

paoloshasta commented 1 month ago

Please don't close this issue. I added the "Bug" label because of the assertion you encountered in your assembly. That needs to be fixed. At the very least the code should end with a proper error message.

jelber2 commented 1 month ago

I am just reporting that after adding more sequences to the original reads reported previously (the new, added reads have an N50 25Kbp, though), there were:

Total number of raw bases is 137407791376. ~ 44x coverage

and there was an assembly produced, so I am presuming it was as you suggest, a coverage issue, that 23x coverage was not enough with that preset.

paoloshasta commented 1 month ago

Thank you for reporting this.

At this point we are focusing on the new Nanopore r10.4.1 SUP reads, also error corrected with Herro, which as an added benefit are also much longer. We are working on optimizing the assembly configuration for those reads.

colindaven commented 1 month ago

@jelber2 I always aim for 40X + coverage for good shasta assemblies with N50 > 10mbp. That said, I do not monitor input coverage that closely, but that is what I expect our providers to deliver. With 40x+ coverage I continually get very good assemblies across a range of plants.

We are starting to look at Herro/Dorado correct and this looks very positive so far, especially as coverage seems to not be adversely impacted.

paoloshasta commented 1 month ago

Yes, I agree with 40X+ from what I have seen so far with Shasta. But with these latest reads with much higher quality it might be possible to lower coverage and still get reasonably good assemblies. We have not looked at this yet, but at some point we plan to experiment with downsampling and understand how much assembly quality degrades at low coverage, and to what extent you can get away with low coverage.