paoloshasta / shasta

De novo assembly from Oxford Nanopore reads.
https://paoloshasta.github.io/shasta/
Other
74 stars 11 forks source link

Memory for Wheat Assembly #9

Closed anandravi95 closed 1 year ago

anandravi95 commented 1 year ago

Hi @paoloshasta , I am trying to do a Wheat assembly with the file size of 2TB. The config used was Nanopore-May2022 and the memory given for LSF was 2.8TB. The assembly ran for ~14hrs and then crashed because of memory issue.

How much memory do you think would be needed for a file of this size?

paoloshasta commented 1 year ago

For an assembly to run at full speed, you typically need between 6 and 8 bytes per base.

In the wording below, I use the suffix B for bytes and the suffix b for bases.

If your 2 TB of input are in a fastq file, you have about 1 Tb, and so you would need 6 to 8 TB of memory. If your 2 TB of input are in a fasta file, you have about 2 Tb, and so you would need 12 to 16 TB of memory. So it is not surprising that 2.8 TB was not sufficient. However there are ways to reduce these requirements.

First of all, the Nanopore-May2022 assembly configuration uses a read length cutoff of 10 Kb. Shorter reads will be discarded, and this will reduce the effective size of your input. For the sake of the discussion below, I will make the assumption that your 2 TB of input is a fastq file, which means that you have about 1 Tb of input. I will further assume that of those about 800 Gb are in reads longer than 10 Kb.

Under these assumptions, you would need about 5 to 7 TB of memory. If you have access to a machine of that size, I suggest running with --memoryMode filesystem --memoryBacking 2M for best performance (see here for more information). However those options require root access. If you have access to a machine of that size, but without root access, your assembly will still run without specifying those memory options, but at reduced speed.

If you don't have access to a machine of that size, there are still a couple of things you can try.

The first option is to tell Shasta to use memory mapped to disk. You do this via --memoryMode filesystem --memoryBacking disk. See here for more information of that, but in summary this is only practical if your storage system uses SSDs (not disks), and in addition the volume that contains your assembly directory must have at least 5 to 7 TB of available space. This option will slow down your assembly considerably, but hopefully to a tolerable extent. The slowdown will be dependent on the amount of available physical memory.

The second option is to reduce marker density. This is the fraction of k-mers that are used as markers by Shasta. It is controlled by command line option --Kmers.probability and set to 0.1 in the assembly configuration you are using. When that is reduced, memory requirements are reduced almost by same factor. So for example if you reduced that to 0.05 your memory requirement would be somewhere around 3 TB. This has usually only a small effect on assembly contiguity and accuracy - however some tweaks of other assembly parameters may be required. If you run an assembly with this option, please post AssemblySummary.html and I can give suggestions.

Either way, de novo assembly of such a large genome is challenging. If successful, this will be the largest genome ever assembled with Shasta to my knowledge, and I will be happy to help out in the process. Best of luck and let me know how it goes!

anandravi95 commented 1 year ago

Thank you for such an informative reply!! Will try with the settings you suggested and keep you updated with the results :)

anandravi95 commented 1 year ago

Hi @paoloshasta

The second option is to reduce marker density. This is the fraction of k-mers that are used as markers by Shasta. It is controlled by command line option --Kmers.probability and set to 0.1 in the assembly configuration you are using. When that is reduced, memory requirements are reduced almost by same factor. So for example if you reduced that to 0.05 your memory requirement would be somewhere around 3 TB. This has usually only a small effect on assembly contiguity and accuracy - however some tweaks of other assembly parameters may be required. If you run an assembly with this option, please post AssemblySummary.html and I can give suggestions.

I tried with the --Kmers.probability and as you had mentioned the memory usage was less compared to the previous result.

I used two config files ; Nanopore-Phased-May2022 and Nanopore-Phased-Jan2022 as to run with diploid characteristics and avoid squashing of all the 6 chromosomes into 1.

Below are the best Assembly stats I have got till now from the file Assembly-Phased.fasta files with tweaking of some parameters. I have uploaded the AssemblySummary.html file as well as you had requested.

Assembly.stats

config : Nanopore-Phased-Jan2022

Parameters:

--Kmers.probability 0.05

--Reads.minReadLength 40000

Max memory used : 1.1TB
A   C   G   T   N   IUPAC   Other   GC  GC_stdev
0.2748  0.2275  0.2264  0.2714  0.0000  0.0000  0.0000  0.4539  0.0963

Main genome scaffold total:             247684
Main genome contig total:               247684
Main genome scaffold sequence total:    8078.831 MB
Main genome contig sequence total:      8078.831 MB     0.000% gap
Main genome scaffold N/L50:             20912/117.136 KB
Main genome contig N/L50:               20912/117.136 KB
Main genome scaffold N/L90:             7863/185.348 KB
Main genome contig N/L90:               7863/185.348 KB
Max scaffold length:                    1.477 MB
Max contig length:                      1.477 MB
Number of scaffolds > 50 KB:            59773
% main genome in scaffolds > 50 KB:     87.67%

Minimum     Number          Number          Total           Total           Scaffold
Scaffold    of              of              Scaffold        Contig          Contig  
Length      Scaffolds       Contigs         Length          Length          Coverage
--------    --------------  --------------  --------------  --------------  --------
    All            247,684         247,684   8,078,830,637   8,078,830,637   100.00%
     50            224,710         224,710   8,078,266,209   8,078,266,209   100.00%
    100            204,897         204,897   8,076,812,209   8,076,812,209   100.00%
    250            167,706         167,706   8,070,636,880   8,070,636,880   100.00%
    500            138,193         138,193   8,060,075,754   8,060,075,754   100.00%
   1 KB            114,383         114,383   8,043,215,902   8,043,215,902   100.00%
 2.5 KB             96,572          96,572   8,015,982,493   8,015,982,493   100.00%
   5 KB             90,670          90,670   7,995,681,285   7,995,681,285   100.00%
  10 KB             87,521          87,521   7,972,851,586   7,972,851,586   100.00%
  25 KB             79,076          79,076   7,822,568,184   7,822,568,184   100.00%
  50 KB             59,774          59,774   7,082,885,303   7,082,885,303   100.00%
 100 KB             27,366          27,366   4,737,674,491   4,737,674,491   100.00%
 250 KB              3,489           3,489   1,218,331,080   1,218,331,080   100.00%
 500 KB                328             328     206,215,929     206,215,929   100.00%
   1 MB                  6               6       6,972,560       6,972,560   100.00%

AssemblySummary.html.gz

#############################################################

config: Nanopore-Phased-May2022

Parameters:

--Reads.minReadLength 40000 

--Kmers.probability 0.07

Max memory used: 1.2TB
A   C   G   T   N   IUPAC   Other   GC  GC_stdev
0.2743  0.2267  0.2264  0.2726  0.0000  0.0000  0.0000  0.4531  0.0936

Main genome scaffold total:             243310
Main genome contig total:               243310
Main genome scaffold sequence total:    8036.524 MB
Main genome contig sequence total:      8036.524 MB     0.000% gap
Main genome scaffold N/L50:             20044/121.025 KB
Main genome contig N/L50:               20044/121.025 KB
Main genome scaffold N/L90:             7568/192.327 KB
Main genome contig N/L90:               7568/192.327 KB
Max scaffold length:                    1.6 MB
Max contig length:                      1.6 MB
Number of scaffolds > 50 KB:            57926
% main genome in scaffolds > 50 KB:     87.48%

Minimum     Number          Number          Total           Total           Scaffold
Scaffold    of              of              Scaffold        Contig          Contig  
Length      Scaffolds       Contigs         Length          Length          Coverage
--------    --------------  --------------  --------------  --------------  --------
    All            243,310         243,310   8,036,523,884   8,036,523,884   100.00%
     50            221,540         221,540   8,035,987,591   8,035,987,591   100.00%
    100            202,328         202,328   8,034,574,930   8,034,574,930   100.00%
    250            164,529         164,529   8,028,277,340   8,028,277,340   100.00%
    500            134,315         134,315   8,017,465,251   8,017,465,251   100.00%
   1 KB            111,044         111,044   8,001,097,477   8,001,097,477   100.00%
 2.5 KB             93,711          93,711   7,974,493,325   7,974,493,325   100.00%
   5 KB             88,154          88,154   7,955,258,637   7,955,258,637   100.00%
  10 KB             85,196          85,196   7,934,263,112   7,934,263,112   100.00%
  25 KB             78,171          78,171   7,806,953,475   7,806,953,475   100.00%
  50 KB             57,926          57,926   7,030,355,797   7,030,355,797   100.00%
 100 KB             27,372          27,372   4,824,745,073   4,824,745,073   100.00%
 250 KB              3,730           3,730   1,314,586,799   1,314,586,799   100.00%
 500 KB                345             345     217,658,228     217,658,228   100.00%
   1 MB                 14              14      17,096,969      17,096,969   100.00%

AssemblySummary.html.gz

##########################################################################

config : Nanopore-Phased-May2022

parameter: --Kmers.probability 0.04

Max memory : 1.6TB
A   C   G   T   N   IUPAC   Other   GC  GC_stdev
0.2767  0.2238  0.2236  0.2758  0.0000  0.0000  0.0000  0.4475  0.0818

Main genome scaffold total:             118128
Main genome contig total:               118128
Main genome scaffold sequence total:    2129.694 MB
Main genome contig sequence total:      2129.694 MB     0.000% gap
Main genome scaffold N/L50:             10496/55.68 KB
Main genome contig N/L50:               10496/55.68 KB
Main genome scaffold N/L90:             39582/14.151 KB
Main genome contig N/L90:               39582/14.151 KB
Max scaffold length:                    1.211 MB
Max contig length:                      1.211 MB
Number of scaffolds > 50 KB:            12336
% main genome in scaffolds > 50 KB:     54.55%

Minimum     Number          Number          Total           Total           Scaffold
Scaffold    of              of              Scaffold        Contig          Contig  
Length      Scaffolds       Contigs         Length          Length          Coverage
--------    --------------  --------------  --------------  --------------  --------
    All            118,128         118,128   2,129,694,011   2,129,694,011   100.00%
     50            113,769         113,769   2,129,585,360   2,129,585,360   100.00%
    100            109,321         109,321   2,129,256,993   2,129,256,993   100.00%
    250             98,395          98,395   2,127,402,626   2,127,402,626   100.00%
    500             87,250          87,250   2,123,356,559   2,123,356,559   100.00%
   1 KB             76,141          76,141   2,115,404,211   2,115,404,211   100.00%
 2.5 KB             65,058          65,058   2,097,935,854   2,097,935,854   100.00%
   5 KB             55,009          55,009   2,059,729,745   2,059,729,745   100.00%
  10 KB             46,189          46,189   1,996,533,289   1,996,533,289   100.00%
  25 KB             26,492          26,492   1,667,739,007   1,667,739,007   100.00%
  50 KB             12,336          12,336   1,161,854,381   1,161,854,381   100.00%
 100 KB              3,420           3,420     544,985,033     544,985,033   100.00%
 250 KB                307             307     103,757,568     103,757,568   100.00%
 500 KB                 20              20      12,301,665      12,301,665   100.00%
   1 MB                  1               1       1,211,020       1,211,020   100.00%

AssemblySummary.html.gz

##########################################################################

Giving read length cut-off and Kmers value helps but giving a value of more than 50k for the read length is reducing the assembly size.

Overall, the desired assembly size would be around 12-15GB but will be very hard to get there.

From the biology point of view, Wheat being hexaploid,has 3 ancestral diploid genome. Does Shasta support assembly for only one ancestor?? (correct me if I am wrong).

Hope the results gives some idea and would be good to know how to improve the results.

Please let me know if there is any other information I can provide with or if I missed something.

Under these assumptions, you would need about 5 to 7 TB of memory. If you have access to a machine of that size, I suggest running with --memoryMode filesystem --memoryBacking 2M for best performance (see here for more information). However those options require root access. If you have access to a machine of that size, but without root access, your assembly will still run without specifying those memory options, but at reduced speed.

Unfortunately, I don't have root acccess neither do I have access to a machine of that size.

paoloshasta commented 1 year ago

The first two assemblies, with a 40 Kb read length cutoff, use 308 Gb of input reads. For a 16 Gb genome, this is about 19x coverage. This is generally too low for a good assembly. As a result the assembly summary shows that the read graph only uses about 8 million alignments for 6 million reads, and that almost half of the read bases are isolated in the read graph - that is, they don't participate in the assembly.

The third assembly does not have this problem, and yet it assembled a lot less sequence, probably due to the lower effective read length.

I think one key problem here is the use of phased assembly. Shasta makes a hard assumption of a diploid assembly, which works well for a human genome, but here you really have ploidy 6. As a result, in all three assemblies most of the sequence is assembled "outside bubble chains", which means that the assembler was able to do almost no phasing. This is not surprising as the code is only prepared to deal with two sequence copies, and here you have 6.

Given this, I expect that you will have better luck with a haploid assembly. A haploid assembly with the standard 10 Kb read length cutoff will give you the same coverage as your third assembly, 795 Gb or 50x for a 16 Gb genome, which is a good amount of coverage for Shasta. If you run it with marker density 0.04 like your third assembly you should be again able to assemble in a few hours, because the memory usage of haploid and diploid assembly is comparable.

In summary, I suggest the following for your next attempt:

--config Nanopore-May2022 --Kmers.probability 0.04

(This will use the 10 Kb read length cutoff from Nanopore-May2022).

If you try this, let me know how it goes, again posting AssemblySummary.html. It is possible/likely that additional tweaks will be needed, but I think this is a logical next step. The good news is that you are able to run these assemblies in a few hours on machines that are available to you.

Depending on the similarity between the 6 copies, in a haploid assembly some sequence will likely be collapsed (two or three copies assembled as one), and so I think it is unlikely that you will get the full expected genome size. I am working on improvements in Shasta that should allow it to do a better job at separating multiple copies of similar sequence. This is mostly motivated by segmental duplications in human assemblies, but should also help with high ploidy genomes.

As an alternative, using Ultra-Long (UL) reads and the existing Shasta haploid assembly also has the potential for doing a better job at separating sequence copies. But I recognize that getting a sufficient amount of UL reads for such a large genome is large undertaking.

paoloshasta commented 1 year ago

Your question on assembling only one of the three ancestors is related to my above comment about collapsed copies. In a Shasta haploid assembly, it is often possible to increase the amount of collapsing by loosening alignment criteria. Key parameters affecting this are the similarity between the three ancestors, and the heterozygosity rate within each of the three. Do you have any estimates for those?

anandravi95 commented 1 year ago

I will post the results for--config Nanopore-May2022 --Kmers.probability 0.04 soon.

Your question on assembling only one of the three ancestors is related to my above comment about collapsed copies. In a Shasta haploid assembly, it is often possible to increase the amount of collapsing by loosening alignment criteria. Key parameters affecting this are the similarity between the three ancestors, and the heterozygosity rate within each of the three. Do you have any estimates for those?

I don't have an estimate for the heterozygosity rates. By alignment criteria you mean the minAlignedFraction value?

paoloshasta commented 1 year ago

Mostly that, but also minAlignedMarkerCount and, to a lesser extent, maxSkip, maxDrift, and maxTrim. If you experiment with these 5 alignment criteria, keep in mind they are all specified in marker space.

Also keep in mind --config Nanopore-May2022 uses adaptive selection of alignment criteria (via --ReadGraph.creationMethod 2). For the alignment criteria you specify to become effective you also need to use --ReadGraph.creationMethod 0. If you experiment with alignment criteria, you can use as starting points and for guidance the values adaptively selected when using --ReadGraph.creationMethod 2, and reported in AssemblySummary.html under the heading Alignment criteria actually used for creation of the read graph.

If you switch to --ReadGraph.creationMethod 0 I also suggest increasing the number of MinHash iterations, for example --MinHash.minHashIterationCount 100. More MinHash iterations will be beneficial when using --ReadGraph.creationMethod 0, but somehow interfere with adaptive selection of alignment criteria. You could also consider adjusting --MinHash.minBucketSize and MinHash.maxBucketSize using the information in LowHashBucketHistogram.csv for guidance - let me know if you need more information on that. There was some discussion on that in another Shasta issue (possibly in the old Shasta repository https://github.com/chanzuckerberg/shasta).

anandravi95 commented 1 year ago

I tried with the following parameters:

config: Nanopore-May2022

--Reads.minReadLength 40000

--Kmers.probability 0.04

--ReadGraph.creationMethod 0 

--MinHash.minHashIterationCount 100 

--Align.minAlignedMarkerCount 100
Memory used: 2.2TB

Run time: ~8.5hrs

Assembly stats

A   C   G   T   N   IUPAC   Other   GC  GC_stdev
0.2733  0.2290  0.2275  0.2702  0.0000  0.0000  0.0000  0.4565  0.0383

Main genome scaffold total:             69772
Main genome contig total:               69772
Main genome scaffold sequence total:    12420.223 MB
Main genome contig sequence total:      12420.223 MB    0.000% gap
Main genome scaffold N/L50:             11653/322.849 KB
Main genome contig N/L50:               11653/322.849 KB
Main genome scaffold N/L90:             2435/643.919 KB
Main genome contig N/L90:               2435/643.919 KB
Max scaffold length:                    3.391 MB
Max contig length:                      3.391 MB
Number of scaffolds > 50 KB:            50478
% main genome in scaffolds > 50 KB:     97.32%

Minimum     Number          Number          Total           Total           Scaffold
Scaffold    of              of              Scaffold        Contig          Contig  
Length      Scaffolds       Contigs         Length          Length          Coverage
--------    --------------  --------------  --------------  --------------  --------
    All             69,772          69,772  12,420,223,033  12,420,223,033   100.00%
     50             69,434          69,434  12,420,212,702  12,420,212,702   100.00%
    100             69,262          69,262  12,420,199,692  12,420,199,692   100.00%
    250             68,918          68,918  12,420,142,287  12,420,142,287   100.00%
    500             68,482          68,482  12,419,979,256  12,419,979,256   100.00%
   1 KB             67,787          67,787  12,419,467,225  12,419,467,225   100.00%
 2.5 KB             66,171          66,171  12,416,657,071  12,416,657,071   100.00%
   5 KB             64,055          64,055  12,408,841,116  12,408,841,116   100.00%
  10 KB             60,960          60,960  12,386,230,638  12,386,230,638   100.00%
  25 KB             56,314          56,314  12,309,701,693  12,309,701,693   100.00%
  50 KB             50,479          50,479  12,087,143,584  12,087,143,584   100.00%
 100 KB             38,222          38,222  11,174,544,399  11,174,544,399   100.00%
 250 KB             16,802          16,802   7,669,493,356   7,669,493,356   100.00%
 500 KB              4,838           4,838   3,499,777,076   3,499,777,076   100.00%
   1 MB                504             504     652,356,031     652,356,031   100.00%
 2.5 MB                  5               5      14,719,002      14,719,002   100.00%

AssemblySummary.html.gz

Depending on the similarity between the 6 copies, in a haploid assembly some sequence will likely be collapsed (two or three copies assembled as one), and so I think it is unlikely that you will get the full expected genome size. I am working on improvements in Shasta that should allow it to do a better job at separating multiple copies of similar sequence. This is mostly motivated by segmental duplications in human assemblies, but should also help with high ploidy genomes.

The genome size has improved after haploid assembly and tweaking parameters you had suggested. Might be because of the collapsing of sequences?

Keeping the standard 10 Kb read length cutoff was giving memory issue. Although, an assembly is running with the 10Kb default value. Hopefully it doesn't crash again. I will post the results of that soon once it is done.

I am very curious to know in detail what these parameters does and some more information about the LowHashBucketHistogram.csv would be really helpful as I couldn't find a issue related to it.

paoloshasta commented 1 year ago

At 12.4 Gb assembled you are getting in your expected range 12-15 Gb. The big increase in assembled sequence is due to switching from phased diploid assembly to haploid assembly. Phased diploid assembly works well for a genome that is actually diploid, but your situation is very different, and the assumptions made during diploid assembly are not helpful.

But the assembly is still highly fragmented, with an N50 of only 323 Kb. This is not surprising because of the low amount of coverage. With the 40 Kb read length cutoff, the assembly only used 308 Gb as input, which is about 20x for a 15 Gb genome. High fragmentation (low N50) is the most common symptom of low coverage. So I think you need to increase coverage by reducing the read length cutoff. Perhaps not necessarily down to 10 Kb, but possibly somewhere in between. I would want to make sure to have at least 40x coverage, which for a 15 Gb genome means 600 Gb of input. If this gives you memory problems, reduce the marker density (--Kmers.probability) further, as that does not seem to be causing issues.

And you should definitely optimize the choices of the minimum/maximum bucket sizes for the MinHash algorithm. You can find here the discussion I was referring to. Let me know if additional clarification on that is needed.

sivico26 commented 1 year ago

@anandravi95 have a look at #11. TL;DR: Give it a try to Nanopore-Plants-Apr2021.conf. It might work better for your data.

anandravi95 commented 1 year ago

@anandravi95 have a look at #11. TL;DR: Give it a try to Nanopore-Plants-Apr2021.conf. It might work better for your data.

Thank you for the suggestion. I will try and post the results later.

colindaven commented 1 year ago

This was a trial dataset with public older 9.4.1 data. We recently got excellent brand new ONT 10.4.1 data Kit14 from the same species.

Again, this is a ca 17 GB plant genome.

Read error rate after alignment to a close reference was modal 98.5 % (was 88% modal in the first 9.4.1 dataset analyzed by Anandh). Both analyses performed by cramino.

On these 98.5% accuracy modal data Shasta performed excellently with the May 2022 config (tried other configs, but all signficantly worse).

Stats: 19.3 MB N50. ca. 2800 contigs. Assembly size about 13.6 GBP, bit on the small side.

Time/resources: 12 hours on an 80 core 3 TB RAM (2.7 TB used) 7+ year old machine.

Data prep: The main trick used was to filter using filtlong the 70% longest reads, which subsampled to about 40X data, so that this would fit into the 3TB of RAM available.

Thanks again for this awesome assembler!

Hope that helps, Colin

paoloshasta commented 1 year ago

Nice and thank you for the good words! A few comments/questions follow.

On these 98.5% accuracy modal data Shasta performed excellently with the May 2022 config (tried other configs, but all signficantly worse).

In your experiments with other configs, did you try the Nanopore-R10-Fast-Nov2022 assembly configuration? This was optimized especially for R10 reads at 35x coverage - but for human genomes. It is available in Shasta 0.11.1.

ca. 2800 contigs.

In Shasta I made a conscious decision to write out all assembled contigs regardless of length. This to avoid loss of information, and because there may be cases where some of the short ones are interesting for a reason or another. And it is easy to filter them out based on length. For this reason, the number of assembled contigs is not a significant metric.

Data prep: The main trick used was to filter using filtlong the 70% longest reads, which subsampled to about 40X data, so that this would fit into the 3TB of RAM available.

You can also use Shasta command line options --Reads.desiredCoverage or --Reads.minReadLength to achieve the same result without an extra step. The filtering is done on the fly while the reads are loaded. But filtering in advance can be useful when doing multiple assemblies on the same reads, so you don't have to read the complete dataset each time.

To increase the amount of assembled sequence, you could consider increasing coverage used for assembly to around 60x. If this makes the run unmanageable for your 3 TB machine you could reduce marker density (--Kmers.probability) a bit to reduce memory requirements.

paoloshasta commented 1 year ago

I am closing this due to lack of discussion. Feel free to reopen it or create a new issue if additional topics emerge.

paoloshasta commented 1 year ago

Ditto