Closed anandravi95 closed 1 year ago
For an assembly to run at full speed, you typically need between 6 and 8 bytes per base.
In the wording below, I use the suffix B
for bytes and the suffix b
for bases.
If your 2 TB of input are in a fastq file, you have about 1 Tb, and so you would need 6 to 8 TB of memory. If your 2 TB of input are in a fasta file, you have about 2 Tb, and so you would need 12 to 16 TB of memory. So it is not surprising that 2.8 TB was not sufficient. However there are ways to reduce these requirements.
First of all, the Nanopore-May2022
assembly configuration uses a read length cutoff of 10 Kb. Shorter reads will be discarded, and this will reduce the effective size of your input. For the sake of the discussion below, I will make the assumption that your 2 TB of input is a fastq file, which means that you have about 1 Tb of input. I will further assume that of those about 800 Gb are in reads longer than 10 Kb.
Under these assumptions, you would need about 5 to 7 TB of memory. If you have access to a machine of that size, I suggest running with --memoryMode filesystem --memoryBacking 2M
for best performance (see here for more information). However those options require root access. If you have access to a machine of that size, but without root access, your assembly will still run without specifying those memory options, but at reduced speed.
If you don't have access to a machine of that size, there are still a couple of things you can try.
The first option is to tell Shasta to use memory mapped to disk. You do this via --memoryMode filesystem --memoryBacking disk
. See here for more information of that, but in summary this is only practical if your storage system uses SSDs (not disks), and in addition the volume that contains your assembly directory must have at least 5 to 7 TB of available space. This option will slow down your assembly considerably, but hopefully to a tolerable extent. The slowdown will be dependent on the amount of available physical memory.
The second option is to reduce marker density. This is the fraction of k-mers that are used as markers by Shasta. It is controlled by command line option --Kmers.probability
and set to 0.1 in the assembly configuration you are using. When that is reduced, memory requirements are reduced almost by same factor. So for example if you reduced that to 0.05 your memory requirement would be somewhere around 3 TB. This has usually only a small effect on assembly contiguity and accuracy - however some tweaks of other assembly parameters may be required. If you run an assembly with this option, please post AssemblySummary.html
and I can give suggestions.
Either way, de novo assembly of such a large genome is challenging. If successful, this will be the largest genome ever assembled with Shasta to my knowledge, and I will be happy to help out in the process. Best of luck and let me know how it goes!
Thank you for such an informative reply!! Will try with the settings you suggested and keep you updated with the results :)
Hi @paoloshasta
The second option is to reduce marker density. This is the fraction of k-mers that are used as markers by Shasta. It is controlled by command line option
--Kmers.probability
and set to 0.1 in the assembly configuration you are using. When that is reduced, memory requirements are reduced almost by same factor. So for example if you reduced that to 0.05 your memory requirement would be somewhere around 3 TB. This has usually only a small effect on assembly contiguity and accuracy - however some tweaks of other assembly parameters may be required. If you run an assembly with this option, please postAssemblySummary.html
and I can give suggestions.
I tried with the --Kmers.probability
and as you had mentioned the memory usage was less compared to the previous result.
I used two config files ; Nanopore-Phased-May2022 and Nanopore-Phased-Jan2022 as to run with diploid characteristics and avoid squashing of all the 6 chromosomes into 1.
Below are the best Assembly stats I have got till now from the file Assembly-Phased.fasta files with tweaking of some parameters. I have uploaded the AssemblySummary.html file as well as you had requested.
Assembly.stats
config : Nanopore-Phased-Jan2022
Parameters:
--Kmers.probability 0.05
--Reads.minReadLength 40000
Max memory used : 1.1TB
A C G T N IUPAC Other GC GC_stdev
0.2748 0.2275 0.2264 0.2714 0.0000 0.0000 0.0000 0.4539 0.0963
Main genome scaffold total: 247684
Main genome contig total: 247684
Main genome scaffold sequence total: 8078.831 MB
Main genome contig sequence total: 8078.831 MB 0.000% gap
Main genome scaffold N/L50: 20912/117.136 KB
Main genome contig N/L50: 20912/117.136 KB
Main genome scaffold N/L90: 7863/185.348 KB
Main genome contig N/L90: 7863/185.348 KB
Max scaffold length: 1.477 MB
Max contig length: 1.477 MB
Number of scaffolds > 50 KB: 59773
% main genome in scaffolds > 50 KB: 87.67%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 247,684 247,684 8,078,830,637 8,078,830,637 100.00%
50 224,710 224,710 8,078,266,209 8,078,266,209 100.00%
100 204,897 204,897 8,076,812,209 8,076,812,209 100.00%
250 167,706 167,706 8,070,636,880 8,070,636,880 100.00%
500 138,193 138,193 8,060,075,754 8,060,075,754 100.00%
1 KB 114,383 114,383 8,043,215,902 8,043,215,902 100.00%
2.5 KB 96,572 96,572 8,015,982,493 8,015,982,493 100.00%
5 KB 90,670 90,670 7,995,681,285 7,995,681,285 100.00%
10 KB 87,521 87,521 7,972,851,586 7,972,851,586 100.00%
25 KB 79,076 79,076 7,822,568,184 7,822,568,184 100.00%
50 KB 59,774 59,774 7,082,885,303 7,082,885,303 100.00%
100 KB 27,366 27,366 4,737,674,491 4,737,674,491 100.00%
250 KB 3,489 3,489 1,218,331,080 1,218,331,080 100.00%
500 KB 328 328 206,215,929 206,215,929 100.00%
1 MB 6 6 6,972,560 6,972,560 100.00%
#############################################################
config: Nanopore-Phased-May2022
Parameters:
--Reads.minReadLength 40000
--Kmers.probability 0.07
Max memory used: 1.2TB
A C G T N IUPAC Other GC GC_stdev
0.2743 0.2267 0.2264 0.2726 0.0000 0.0000 0.0000 0.4531 0.0936
Main genome scaffold total: 243310
Main genome contig total: 243310
Main genome scaffold sequence total: 8036.524 MB
Main genome contig sequence total: 8036.524 MB 0.000% gap
Main genome scaffold N/L50: 20044/121.025 KB
Main genome contig N/L50: 20044/121.025 KB
Main genome scaffold N/L90: 7568/192.327 KB
Main genome contig N/L90: 7568/192.327 KB
Max scaffold length: 1.6 MB
Max contig length: 1.6 MB
Number of scaffolds > 50 KB: 57926
% main genome in scaffolds > 50 KB: 87.48%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 243,310 243,310 8,036,523,884 8,036,523,884 100.00%
50 221,540 221,540 8,035,987,591 8,035,987,591 100.00%
100 202,328 202,328 8,034,574,930 8,034,574,930 100.00%
250 164,529 164,529 8,028,277,340 8,028,277,340 100.00%
500 134,315 134,315 8,017,465,251 8,017,465,251 100.00%
1 KB 111,044 111,044 8,001,097,477 8,001,097,477 100.00%
2.5 KB 93,711 93,711 7,974,493,325 7,974,493,325 100.00%
5 KB 88,154 88,154 7,955,258,637 7,955,258,637 100.00%
10 KB 85,196 85,196 7,934,263,112 7,934,263,112 100.00%
25 KB 78,171 78,171 7,806,953,475 7,806,953,475 100.00%
50 KB 57,926 57,926 7,030,355,797 7,030,355,797 100.00%
100 KB 27,372 27,372 4,824,745,073 4,824,745,073 100.00%
250 KB 3,730 3,730 1,314,586,799 1,314,586,799 100.00%
500 KB 345 345 217,658,228 217,658,228 100.00%
1 MB 14 14 17,096,969 17,096,969 100.00%
##########################################################################
config : Nanopore-Phased-May2022
parameter: --Kmers.probability 0.04
Max memory : 1.6TB
A C G T N IUPAC Other GC GC_stdev
0.2767 0.2238 0.2236 0.2758 0.0000 0.0000 0.0000 0.4475 0.0818
Main genome scaffold total: 118128
Main genome contig total: 118128
Main genome scaffold sequence total: 2129.694 MB
Main genome contig sequence total: 2129.694 MB 0.000% gap
Main genome scaffold N/L50: 10496/55.68 KB
Main genome contig N/L50: 10496/55.68 KB
Main genome scaffold N/L90: 39582/14.151 KB
Main genome contig N/L90: 39582/14.151 KB
Max scaffold length: 1.211 MB
Max contig length: 1.211 MB
Number of scaffolds > 50 KB: 12336
% main genome in scaffolds > 50 KB: 54.55%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 118,128 118,128 2,129,694,011 2,129,694,011 100.00%
50 113,769 113,769 2,129,585,360 2,129,585,360 100.00%
100 109,321 109,321 2,129,256,993 2,129,256,993 100.00%
250 98,395 98,395 2,127,402,626 2,127,402,626 100.00%
500 87,250 87,250 2,123,356,559 2,123,356,559 100.00%
1 KB 76,141 76,141 2,115,404,211 2,115,404,211 100.00%
2.5 KB 65,058 65,058 2,097,935,854 2,097,935,854 100.00%
5 KB 55,009 55,009 2,059,729,745 2,059,729,745 100.00%
10 KB 46,189 46,189 1,996,533,289 1,996,533,289 100.00%
25 KB 26,492 26,492 1,667,739,007 1,667,739,007 100.00%
50 KB 12,336 12,336 1,161,854,381 1,161,854,381 100.00%
100 KB 3,420 3,420 544,985,033 544,985,033 100.00%
250 KB 307 307 103,757,568 103,757,568 100.00%
500 KB 20 20 12,301,665 12,301,665 100.00%
1 MB 1 1 1,211,020 1,211,020 100.00%
##########################################################################
Giving read length cut-off and Kmers value helps but giving a value of more than 50k for the read length is reducing the assembly size.
Overall, the desired assembly size would be around 12-15GB but will be very hard to get there.
From the biology point of view, Wheat being hexaploid,has 3 ancestral diploid genome. Does Shasta support assembly for only one ancestor?? (correct me if I am wrong).
Hope the results gives some idea and would be good to know how to improve the results.
Please let me know if there is any other information I can provide with or if I missed something.
Under these assumptions, you would need about 5 to 7 TB of memory. If you have access to a machine of that size, I suggest running with
--memoryMode filesystem --memoryBacking 2M
for best performance (see here for more information). However those options require root access. If you have access to a machine of that size, but without root access, your assembly will still run without specifying those memory options, but at reduced speed.
Unfortunately, I don't have root acccess neither do I have access to a machine of that size.
The first two assemblies, with a 40 Kb read length cutoff, use 308 Gb of input reads. For a 16 Gb genome, this is about 19x coverage. This is generally too low for a good assembly. As a result the assembly summary shows that the read graph only uses about 8 million alignments for 6 million reads, and that almost half of the read bases are isolated in the read graph - that is, they don't participate in the assembly.
The third assembly does not have this problem, and yet it assembled a lot less sequence, probably due to the lower effective read length.
I think one key problem here is the use of phased assembly. Shasta makes a hard assumption of a diploid assembly, which works well for a human genome, but here you really have ploidy 6. As a result, in all three assemblies most of the sequence is assembled "outside bubble chains", which means that the assembler was able to do almost no phasing. This is not surprising as the code is only prepared to deal with two sequence copies, and here you have 6.
Given this, I expect that you will have better luck with a haploid assembly. A haploid assembly with the standard 10 Kb read length cutoff will give you the same coverage as your third assembly, 795 Gb or 50x for a 16 Gb genome, which is a good amount of coverage for Shasta. If you run it with marker density 0.04 like your third assembly you should be again able to assemble in a few hours, because the memory usage of haploid and diploid assembly is comparable.
In summary, I suggest the following for your next attempt:
--config Nanopore-May2022 --Kmers.probability 0.04
(This will use the 10 Kb read length cutoff from Nanopore-May2022
).
If you try this, let me know how it goes, again posting AssemblySummary.html
. It is possible/likely that additional tweaks will be needed, but I think this is a logical next step. The good news is that you are able to run these assemblies in a few hours on machines that are available to you.
Depending on the similarity between the 6 copies, in a haploid assembly some sequence will likely be collapsed (two or three copies assembled as one), and so I think it is unlikely that you will get the full expected genome size. I am working on improvements in Shasta that should allow it to do a better job at separating multiple copies of similar sequence. This is mostly motivated by segmental duplications in human assemblies, but should also help with high ploidy genomes.
As an alternative, using Ultra-Long (UL) reads and the existing Shasta haploid assembly also has the potential for doing a better job at separating sequence copies. But I recognize that getting a sufficient amount of UL reads for such a large genome is large undertaking.
Your question on assembling only one of the three ancestors is related to my above comment about collapsed copies. In a Shasta haploid assembly, it is often possible to increase the amount of collapsing by loosening alignment criteria. Key parameters affecting this are the similarity between the three ancestors, and the heterozygosity rate within each of the three. Do you have any estimates for those?
I will post the results for--config Nanopore-May2022 --Kmers.probability 0.04
soon.
Your question on assembling only one of the three ancestors is related to my above comment about collapsed copies. In a Shasta haploid assembly, it is often possible to increase the amount of collapsing by loosening alignment criteria. Key parameters affecting this are the similarity between the three ancestors, and the heterozygosity rate within each of the three. Do you have any estimates for those?
I don't have an estimate for the heterozygosity rates. By alignment criteria you mean the minAlignedFraction
value?
Mostly that, but also minAlignedMarkerCount
and, to a lesser extent, maxSkip
, maxDrift
, and maxTrim
. If you experiment with these 5 alignment criteria, keep in mind they are all specified in marker space.
Also keep in mind --config Nanopore-May2022
uses adaptive selection of alignment criteria (via --ReadGraph.creationMethod 2
). For the alignment criteria you specify to become effective you also need to use --ReadGraph.creationMethod 0
. If you experiment with alignment criteria, you can use as starting points and for guidance the values adaptively selected when using --ReadGraph.creationMethod 2
, and reported in AssemblySummary.html
under the heading Alignment criteria actually used for creation of the read graph
.
If you switch to --ReadGraph.creationMethod 0
I also suggest increasing the number of MinHash iterations, for example --MinHash.minHashIterationCount 100
. More MinHash iterations will be beneficial when using --ReadGraph.creationMethod 0
, but somehow interfere with adaptive selection of alignment criteria. You could also consider adjusting --MinHash.minBucketSize
and MinHash.maxBucketSize
using the information in LowHashBucketHistogram.csv
for guidance - let me know if you need more information on that. There was some discussion on that in another Shasta issue (possibly in the old Shasta repository https://github.com/chanzuckerberg/shasta).
I tried with the following parameters:
config: Nanopore-May2022
--Reads.minReadLength 40000
--Kmers.probability 0.04
--ReadGraph.creationMethod 0
--MinHash.minHashIterationCount 100
--Align.minAlignedMarkerCount 100
Memory used: 2.2TB
Run time: ~8.5hrs
Assembly stats
A C G T N IUPAC Other GC GC_stdev
0.2733 0.2290 0.2275 0.2702 0.0000 0.0000 0.0000 0.4565 0.0383
Main genome scaffold total: 69772
Main genome contig total: 69772
Main genome scaffold sequence total: 12420.223 MB
Main genome contig sequence total: 12420.223 MB 0.000% gap
Main genome scaffold N/L50: 11653/322.849 KB
Main genome contig N/L50: 11653/322.849 KB
Main genome scaffold N/L90: 2435/643.919 KB
Main genome contig N/L90: 2435/643.919 KB
Max scaffold length: 3.391 MB
Max contig length: 3.391 MB
Number of scaffolds > 50 KB: 50478
% main genome in scaffolds > 50 KB: 97.32%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 69,772 69,772 12,420,223,033 12,420,223,033 100.00%
50 69,434 69,434 12,420,212,702 12,420,212,702 100.00%
100 69,262 69,262 12,420,199,692 12,420,199,692 100.00%
250 68,918 68,918 12,420,142,287 12,420,142,287 100.00%
500 68,482 68,482 12,419,979,256 12,419,979,256 100.00%
1 KB 67,787 67,787 12,419,467,225 12,419,467,225 100.00%
2.5 KB 66,171 66,171 12,416,657,071 12,416,657,071 100.00%
5 KB 64,055 64,055 12,408,841,116 12,408,841,116 100.00%
10 KB 60,960 60,960 12,386,230,638 12,386,230,638 100.00%
25 KB 56,314 56,314 12,309,701,693 12,309,701,693 100.00%
50 KB 50,479 50,479 12,087,143,584 12,087,143,584 100.00%
100 KB 38,222 38,222 11,174,544,399 11,174,544,399 100.00%
250 KB 16,802 16,802 7,669,493,356 7,669,493,356 100.00%
500 KB 4,838 4,838 3,499,777,076 3,499,777,076 100.00%
1 MB 504 504 652,356,031 652,356,031 100.00%
2.5 MB 5 5 14,719,002 14,719,002 100.00%
Depending on the similarity between the 6 copies, in a haploid assembly some sequence will likely be collapsed (two or three copies assembled as one), and so I think it is unlikely that you will get the full expected genome size. I am working on improvements in Shasta that should allow it to do a better job at separating multiple copies of similar sequence. This is mostly motivated by segmental duplications in human assemblies, but should also help with high ploidy genomes.
The genome size has improved after haploid assembly and tweaking parameters you had suggested. Might be because of the collapsing of sequences?
Keeping the standard 10 Kb read length cutoff was giving memory issue. Although, an assembly is running with the 10Kb default value. Hopefully it doesn't crash again. I will post the results of that soon once it is done.
I am very curious to know in detail what these parameters does and some more information about the LowHashBucketHistogram.csv
would be really helpful as I couldn't find a issue related to it.
At 12.4 Gb assembled you are getting in your expected range 12-15 Gb. The big increase in assembled sequence is due to switching from phased diploid assembly to haploid assembly. Phased diploid assembly works well for a genome that is actually diploid, but your situation is very different, and the assumptions made during diploid assembly are not helpful.
But the assembly is still highly fragmented, with an N50 of only 323 Kb. This is not surprising because of the low amount of coverage. With the 40 Kb read length cutoff, the assembly only used 308 Gb as input, which is about 20x for a 15 Gb genome. High fragmentation (low N50) is the most common symptom of low coverage. So I think you need to increase coverage by reducing the read length cutoff. Perhaps not necessarily down to 10 Kb, but possibly somewhere in between. I would want to make sure to have at least 40x coverage, which for a 15 Gb genome means 600 Gb of input. If this gives you memory problems, reduce the marker density (--Kmers.probability
) further, as that does not seem to be causing issues.
And you should definitely optimize the choices of the minimum/maximum bucket sizes for the MinHash algorithm. You can find here the discussion I was referring to. Let me know if additional clarification on that is needed.
@anandravi95 have a look at #11. TL;DR: Give it a try to Nanopore-Plants-Apr2021.conf
. It might work better for your data.
@anandravi95 have a look at #11. TL;DR: Give it a try to
Nanopore-Plants-Apr2021.conf
. It might work better for your data.
Thank you for the suggestion. I will try and post the results later.
This was a trial dataset with public older 9.4.1 data. We recently got excellent brand new ONT 10.4.1 data Kit14 from the same species.
Again, this is a ca 17 GB plant genome.
Read error rate after alignment to a close reference was modal 98.5 % (was 88% modal in the first 9.4.1 dataset analyzed by Anandh). Both analyses performed by cramino
.
On these 98.5% accuracy modal data Shasta performed excellently with the May 2022 config (tried other configs, but all signficantly worse).
Stats: 19.3 MB N50. ca. 2800 contigs. Assembly size about 13.6 GBP, bit on the small side.
Time/resources: 12 hours on an 80 core 3 TB RAM (2.7 TB used) 7+ year old machine.
Data prep: The main trick used was to filter using filtlong
the 70% longest reads, which subsampled to about 40X data, so that this would fit into the 3TB of RAM available.
Thanks again for this awesome assembler!
Hope that helps, Colin
Nice and thank you for the good words! A few comments/questions follow.
On these 98.5% accuracy modal data Shasta performed excellently with the May 2022 config (tried other configs, but all signficantly worse).
In your experiments with other configs, did you try the Nanopore-R10-Fast-Nov2022
assembly configuration? This was optimized especially for R10 reads at 35x coverage - but for human genomes. It is available in Shasta 0.11.1.
ca. 2800 contigs.
In Shasta I made a conscious decision to write out all assembled contigs regardless of length. This to avoid loss of information, and because there may be cases where some of the short ones are interesting for a reason or another. And it is easy to filter them out based on length. For this reason, the number of assembled contigs is not a significant metric.
Data prep: The main trick used was to filter using filtlong the 70% longest reads, which subsampled to about 40X data, so that this would fit into the 3TB of RAM available.
You can also use Shasta command line options --Reads.desiredCoverage
or --Reads.minReadLength
to achieve the same result without an extra step. The filtering is done on the fly while the reads are loaded. But filtering in advance can be useful when doing multiple assemblies on the same reads, so you don't have to read the complete dataset each time.
To increase the amount of assembled sequence, you could consider increasing coverage used for assembly to around 60x. If this makes the run unmanageable for your 3 TB machine you could reduce marker density (--Kmers.probability
) a bit to reduce memory requirements.
I am closing this due to lack of discussion. Feel free to reopen it or create a new issue if additional topics emerge.
Ditto
Hi @paoloshasta , I am trying to do a Wheat assembly with the file size of 2TB. The config used was Nanopore-May2022 and the memory given for LSF was 2.8TB. The assembly ran for ~14hrs and then crashed because of memory issue.
How much memory do you think would be needed for a file of this size?