paoloshasta / shasta

De novo assembly from Oxford Nanopore reads.
https://paoloshasta.github.io/shasta/
Other
74 stars 11 forks source link

No alignment #36

Closed portbo closed 1 month ago

portbo commented 1 month ago

Hi, I have a ~130kb N50 single flow cell out put from a UL ONT chemistry 10

Post inital qc Shasta identifies the following: The run finishes without errors but there is no alignment and the fasta is empty:

Command line: ./shasta-Linux-0.12.0 --input /oxford_nanopore_UL/E13.pass.fq --config Nanopore-ncm23-May2024 --Align.alignMethod 3 --assemblyDirectory /oxford_nanopore_UL/output2 For options in use for this assembly, see shasta.conf in the assembly directory.This run uses options "--memoryBacking 4K --memoryMode anonymous". This could result in longer run time. For faster assembly, use "--memoryBacking 2M --memoryMode filesystem" (root privilege via sudo required). Therefore the results of this run should not be used for the purpose of benchmarking assembly time. However the memory options don't affect assembly results in any way. This assembly will use 48 threads. Turning off --Reads.noCache for/oxford_nanopore_UL/E13.pass.fq Discarded read statistics for file /oxford_nanopore_UL/E13.pass.fq: Discarded 0 reads containing invalid bases for a total 0 valid bases.
Discarded 206363 reads shorter than 10000 bases for a total 748307569 bases. Discarded 0 reads containing repeat counts 256 or more for a total 0 bases. Discarded read statistics for all input files: Discarded 0 reads containing invalid bases for a total 0 valid bases.
Discarded 206363 short reads for a total 748307569 bases. Discarded 0 reads containing repeat counts 256 or more for a total 0 bases. Read statistics for reads that will be used in this assembly: Total number of reads is 307254. Total number of raw bases is 24878481361.

I can upload more files if needed. Thanks

performance.log Average read length is 80970.4 bases. N50 for read length is 128596 bases. Found 0 reads with duplicate names. Discarded from the assembly 0 reads with duplicate names. Flagged 945 reads as palindromic out of 307254 total. Palindromic fraction is 0.00307563 LowHash0 algorithm will use 2^30 = 1073741824 buckets. Automatic settings for this iteration: minBucketSize 4, maxBucketSize 6 Alignment candidates after lowhash iteration 0: high frequency 608132, total 1425710, capacity 1425710. Automatic settings for this iteration: minBucketSize 4, maxBucketSize 6 slurm-15143295.outShasta Release 0.12.0 2024-Sep-14 11:25:54.613492 Assembly begins.

paoloshasta commented 1 month ago

The assembly configuration you are using, Nanopore-ncm23-May2024, only applies to experimental data released by ONT in a December 2023 data release. To my knowledge, ONT never supported these reads for customers. It is likely that your reads are more standard R10 reads for which a different assembly configuration is required, mostly due to lower accuracy.

The table in this documentation page summarizes the most useful Shasta assembly configurations.

If your reads use the latest from ONT announced in May 2024 (r10.4.1_e8.2-400bps_sup, see https://labs.epi2me.io/lc2024_t2t/), you can use assembly configuration Nanopore-r10.4.1_e8.2-400bps_sup-Herro-Sep2024 if your reads have been error corrected with Herro, or Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024 if your reads were not error corrected. These two assembly configurations are only supported by Shasta 0.13.0, released last week.

If your reads are more standard R10 reads, I suggest Nanopore-R10-Fast-Nov2022 for haploid assembly, or Nanopore-Phased-R10-Fast-Nov2022 for phased assembly.

portbo commented 1 month ago

Hi Paolo, Thanks for such a quick response. My UL reads are r10.4.1_e8.2_400bps_sup@v4.3.0, so the latest I presume. I shall download shasta v13 and rerun it with the phased assembly config. I can open a new issue if i have issues with that so i shall close this issue. BW.

paoloshasta commented 1 month ago

You should get a good assembly with these reads, provided you have sufficient coverage. Make sure to select the appropriate configuration depending on whether on not you ran Herro error correction on your reads.

I see from the output you attached that you only have 24 Gb of coverage, and if this is for a human genome it is probably insufficient coverage to generate a useful assembly. If it is a smaller genome you may be ok.

See the slides that accompany the Shasta 0.13.0 release for a summary of the results we obtained with these configurations on the reads released by ONT.

portbo commented 1 month ago

Hi, Thanks for the heads up! I didnt run Herro correction hence choosing the other raw config.

It is human genome de novo.

I had about 300Gb of raw data from a single flowcell run as we had issues with pores getting blocked due to the UL DNA and further washes was not possible. Perhaps in the future I will need to run further flowcell washes to bulk up the coverage.

As per the slides you had 45X coverage, so in the future a full three washes may obtain the needed coverage. N50 looks good at 128 hopefully. Thanks again, running it with version 13 now. BW

paoloshasta commented 1 month ago

I am not an expert in sequencing protocols, but ONT says they only used two flowcells to get 45x. The details of the protocol they used are in their announcement, and I am sure they would help if you are getting lower yield than they did. But your read N50 is excellent.

With your existing 25 Gb of coverage, you are at around 8X and that will not be sufficient to get anything close to a usable assembly. At best, it will be very fragmented and highly incomplete.

portbo commented 1 month ago

Still no luck even with version 13 and new config file , the assembly fasta is empty. Shasta Release 0.13.0 2024-Sep-18 00:45:00.895842 Assembly begins. Command line: ./shasta-Linux-0.13.0 --input /data/rds/DGE/DUDGE/MOPOPGEN/studies/Glioma_GSC/oxford_nanopore_UL/E13.pass.fq --config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024 --Align.alignMethod 3 --assemblyDirectory /data/rds/DGE/DUDGE/MOPOPGEN/studies/Glioma_GSC/oxford_nanopore_UL/output For options in use for this assembly, see shasta.conf in the assembly directory.This run uses options "--memoryBacking 4K --memoryMode anonymous". This could result in longer run time. For faster assembly, use "--memoryBacking 2M --memoryMode filesystem" (root privilege via sudo required). Therefore the results of this run should not be used for the purpose of benchmarking assembly time. However the memory options don't affect assembly results in any way. This assembly will use 48 threads. Turning off --Reads.noCache for /oxford_nanopore_UL/E13.pass.fq Discarded read statistics for file /E13.pass.fq: Discarded 0 reads containing invalid bases for a total 0 valid bases.
Discarded 206363 reads shorter than 10000 bases for a total 748307569 bases. Discarded 0 reads containing repeat counts 256 or more for a total 0 bases. Discarded read statistics for all input files: Discarded 0 reads containing invalid bases for a total 0 valid bases.
Discarded 206363 short reads for a total 748307569 bases. Discarded 0 reads containing repeat counts 256 or more for a total 0 bases. Read statistics for reads that will be used in this assembly: Total number of reads is 307254. Total number of raw bases is 24878481361. Average read length is 80970.4 bases. N50 for read length is 128596 bases. Found 0 reads with duplicate names. Discarded from the assembly 0 reads with duplicate names. Flagged 945 reads as palindromic out of 307254 total. Palindromic fraction is 0.00307563 LowHash0 algorithm will use 2^30 = 1073741824 buckets. Automatic settings for this iteration: minBucketSize 27, maxBucketSize 27
Alignment candidates after lowhash iteration 0: high frequency 11, total 57758, capacity 57758. Automatic settings for this iteration: minBucketSize 32, maxBucketSize 32
Alignment candidates after lowhash iteration 1: high frequency 23, total 95724, capacity 102685. Automatic settings for this iteration: minBucketSize 39, maxBucketSize 40 Alignment candidates after lowhash iteration 2: high frequency 62, total 152446, capacity 166465. Automatic settings for this iteration: minBucketSize 33, maxBucketSize 33
Alignment candidates after lowhash iteration 3: high frequency 79, total 187801, capacity 214834. Automatic settings for this iteration: minBucketSize 34, maxBucketSize 35
Alignment candidates after lowhash iteration 4: high frequency 125, total 246849, capacity 287577. Automatic settings for this iteration: minBucketSize 30, maxBucketSize 30
Alignment candidates after lowhash iteration 5: high frequency 170, total 280049, capacity 330254. Automatic settings for this iteration: minBucketSize 34, maxBucketSize 34
Alignment candidates after lowhash iteration 6: high frequency 226, total 303768, capacity 360912. Automatic settings for this iteration: minBucketSize 31, maxBucketSize 31
Alignment candidates after lowhash iteration 7: high frequency 275, total 329046, capacity 394637. Automatic settings for this iteration: minBucketSize 32, maxBucketSize 32
Alignment candidates after lowhash iteration 8: high frequency 345, total 350459, capacity 421848. Automatic settings for this iteration: minBucketSize 29, maxBucketSize 29
Alignment candidates after lowhash iteration 9: high frequency 405, total 378770, capacity 458210. Automatic settings for this iteration: minBucketSize 33, maxBucketSize 33
Alignment candidates after lowhash iteration 10: high frequency 445, total 403984, capacity 489608. Automatic settings for this iteration: minBucketSize 27, maxBucketSize 27
Alignment candidates after lowhash iteration 11: high frequency 495, total 431903, capacity 525071. Automatic settings for this iteration: minBucketSize 28, maxBucketSize 28
Alignment candidates after lowhash iteration 12: high frequency 566, total 456849, capacity 557561. Automatic settings for this iteration: minBucketSize 33, maxBucketSize 33
Alignment candidates after lowhash iteration 13: high frequency 667, total 477142, capacity 581574. Automatic settings for this iteration: minBucketSize 30, maxBucketSize 30
Alignment candidates after lowhash iteration 14: high frequency 748, total 500363, capacity 611012. Automatic settings for this iteration: minBucketSize 25, maxBucketSize 25 Alignment candidates after lowhash iteration 15: high frequency 861, total 531391, capacity 651258. Automatic settings for this iteration: minBucketSize 28, maxBucketSize 28 Alignment candidates after lowhash iteration 16: high frequency 932, total 556588, capacity 682055. Automatic settings for this iteration: minBucketSize 28, maxBucketSize 28
Alignment candidates after lowhash iteration 17: high frequency 1051, total 578183, capacity 710258. Automatic settings for this iteration: minBucketSize 33, maxBucketSize 33
Alignment candidates after lowhash iteration 18: high frequency 1175, total 592425, capacity 728032. Automatic settings for this iteration: minBucketSize 31, maxBucketSize 31
Alignment candidates after lowhash iteration 19: high frequency 1319, total 613936, capacity 754247. Automatic settings for this iteration: minBucketSize 31, maxBucketSize 31
Alignment candidates after lowhash iteration 20: high frequency 1458, total 630289, capacity 774910. Automatic settings for this iteration: minBucketSize 27, maxBucketSize 27
Alignment candidates after lowhash iteration 21: high frequency 1575, total 648961, capacity 798485. Automatic settings for this iteration: minBucketSize 32, maxBucketSize 32
Alignment candidates after lowhash iteration 22: high frequency 1712, total 668124, capacity 821331. Automatic settings for this iteration: minBucketSize 28, maxBucketSize 28
Alignment candidates after lowhash iteration 23: high frequency 1844, total 686101, capacity 845137. Automatic settings for this iteration: minBucketSize 28, maxBucketSize 28
Alignment candidates after lowhash iteration 24: high frequency 2001, total 702085, capacity 865725. Automatic settings for this iteration: minBucketSize 28, maxBucketSize 28
Alignment candidates after lowhash iteration 25: high frequency 2147, total 719183, capacity 889125. Automatic settings for this iteration: minBucketSize 27, maxBucketSize 27
Alignment candidates after lowhash iteration 26: high frequency 2298, total 736827, capacity 912630. Automatic settings for this iteration: minBucketSize 33, maxBucketSize 33
Alignment candidates after lowhash iteration 27: high frequency 2477, total 751083, capacity 929978. Automatic settings for this iteration: minBucketSize 32, maxBucketSize 32
Alignment candidates after lowhash iteration 28: high frequency 2649, total 766918, capacity 949014. Automatic settings for this iteration: minBucketSize 28, maxBucketSize 28
Alignment candidates after lowhash iteration 29: high frequency 2824, total 781106, capacity 968540. Automatic settings for this iteration: minBucketSize 27, maxBucketSize 27
Alignment candidates after lowhash iteration 30: high frequency 2988, total 796264, capacity 988047. Automatic settings for this iteration: minBucketSize 31, maxBucketSize 31 Alignment candidates after lowhash iteration 31: high frequency 3181, total 808886, capacity 1003715. Automatic settings for this iteration: minBucketSize 28, maxBucketSize 28
Alignment candidates after lowhash iteration 32: high frequency 3399, total 820873, capacity 1018417. Automatic settings for this iteration: minBucketSize 35, maxBucketSize 35
Alignment candidates after lowhash iteration 33: high frequency 3575, total 837126, capacity 1038857. Automatic settings for this iteration: minBucketSize 29, maxBucketSize 29
Alignment candidates after lowhash iteration 34: high frequency 3782, total 853855, capacity 1061500. Automatic settings for this iteration: minBucketSize 36, maxBucketSize 36
Alignment candidates after lowhash iteration 35: high frequency 3971, total 868967, capacity 1079097. Automatic settings for this iteration: minBucketSize 28, maxBucketSize 28
Alignment candidates after lowhash iteration 36: high frequency 4200, total 879743, capacity 1093445. Automatic settings for this iteration: minBucketSize 38, maxBucketSize 38
Alignment candidates after lowhash iteration 37: high frequency 4321, total 895440, capacity 1111685. Automatic settings for this iteration: minBucketSize 25, maxBucketSize 25
Alignment candidates after lowhash iteration 38: high frequency 4523, total 916825, capacity 1140937. Automatic settings for this iteration: minBucketSize 30, maxBucketSize 30
Alignment candidates after lowhash iteration 39: high frequency 4730, total 930419, capacity 1159301. Automatic settings for this iteration: minBucketSize 33, maxBucketSize 33
Alignment candidates after lowhash iteration 40: high frequency 4921, total 943673, capacity 1175647. Automatic settings for this iteration: minBucketSize 34, maxBucketSize 34
Alignment candidates after lowhash iteration 41: high frequency 5140, total 955242, capacity 1189389. Automatic settings for this iteration: minBucketSize 31, maxBucketSize 31
Alignment candidates after lowhash iteration 42: high frequency 5362, total 966986, capacity 1203239. Automatic settings for this iteration: minBucketSize 27, maxBucketSize 27
Alignment candidates after lowhash iteration 43: high frequency 5607, total 981753, capacity 1221418. Automatic settings for this iteration: minBucketSize 27, maxBucketSize 27
Alignment candidates after lowhash iteration 44: high frequency 5875, total 993465, capacity 1236749. Automatic settings for this iteration: minBucketSize 31, maxBucketSize 31
Alignment candidates after lowhash iteration 45: high frequency 6127, total 1005958, capacity 1251957. Automatic settings for this iteration: minBucketSize 36, maxBucketSize 36
Alignment candidates after lowhash iteration 46: high frequency 6386, total 1018765, capacity 1267085. Automatic settings for this iteration: minBucketSize 35, maxBucketSize 36
Alignment candidates after lowhash iteration 47: high frequency 6856, total 1045405, capacity 1299703. Automatic settings for this iteration: minBucketSize 33, maxBucketSize 33
Alignment candidates after lowhash iteration 48: high frequency 7085, total 1057113, capacity 1313181. Automatic settings for this iteration: minBucketSize 30, maxBucketSize 30
Alignment candidates after lowhash iteration 49: high frequency 7317, total 1068013, capacity 1326806. Found 7317 alignment candidates. Average number of alignment candidates per oriented read is 0.0238142. Number of alignment candidates before suppression is 7317 Suppressed 0 alignment candidates. Number of alignment candidates after suppression is 7317 2024-Sep-18 00:54:59.534624 Alignment computation begins. 2024-Sep-18 00:55:02.079941 Alignment computation completed. Found and stored 2485 good alignments. Keeping 2485 alignments of 2485 Flagged 48 reads as chimeric out of 307254 total. Chimera rate is 0.000156223 Strand separation flagged 8 read graph edges out of 4970 total. Kept 2473374700 disjoint sets with coverage in the requested range. Found 0 disjoint sets with more than one marker on a single oriented read or with less than 0 supporting oriented reads on each strand. Automatically set: minPrimaryCoverage = 13, maxPrimaryCoverage = 15 Found 0 primary marker graph edges. Global assembly statistics by P-Value: Global assembly statistics, combined for all P-values: total assembled length 0, N50 0 Total non-trivial bubble chain length 0, N50 0 This run used options "--memoryBacking 4K --memoryMode anonymous". This could result in longer run time. For faster assembly, use "--memoryBacking 2M --memoryMode filesystem" (root privilege via sudo required). Therefore the results of this run should not be used for the purpose of benchmarking assembly time. However the memory options don't affect assembly results in any way. Shasta Release 0.13.0 2024-Sep-18 00:57:55.350214 Assembly ends.

portbo commented 1 month ago

Assembly.csv AssemblySummary.json performance.log stdout.log

colindaven commented 1 month ago

In my hands shasta performs very reliably and well for plant genomes sequenced at 40-60X coverage (N50s 20 - 45 MB ) with one of the current configs (sometimes you have to test a few).

I have had suboptimal fragmented assemblies with <0.5 MB N50 for large 4GB genomes with only 8-10X coverage. 8-10X coverage is not really sufficient for a good assembly, whatever the tool. I did have slightly better results with Flye though for very low coverage datasets. Canu might be another option (but it is slow).

You could also try correcting your reads with dorado correct, and then retrying shasta/flye/hifiasm.

portbo commented 1 month ago

Thanks Colin, My reads have an N50 of 130Kb, but yes I shall correct my reads and experiment with various configs and aligners. BW.

paoloshasta commented 1 month ago

This is an attempt at assembling a human genome using 25 Gb of coverage, equivalent to about 8X coverage, or 4X per haplotype (this assembly configuration is for phased assembly). As I predicted in my post yesterday, this is insufficient coverage for a useful assembly. No changes of options and parameters, or error correction, will be able to make up for this.

We tested the assembly configuration you are using at 45X and got a good assembly as described in the slides that accompany Shasta Release 0.13.0. I expect that you might still be able to get a usable assembly at around 25X to 30X, but it will be fragmented and not very complete. Your current cuverage is 3 to 4 times below that, so I think it is hopeless and your only recourse will be to do more sequencing.