Assembly configurations for ONT R10 reads

paoloshasta commented 1 year ago

As soon as ONT R10 reads are available in a stable form, new assembly configurations for those reads should be created.

This corresponds to issues 273 and 295 in the pre-fork chanzuckerberg/shasta repository @yuz682 @rob234king.

paoloshasta commented 1 year ago

An assembly configuration for haploid assembly of ONT R10 reads, fast mode, is available in shasta/conf/Nanopore-R10-Fast-Nov2022.conf (direct download here). This will become a Shasta built-in assembly configuration, but for now you can use it via this file. It is compatible with Shasta 0.10.0.

See the comments at the beginning of the file for applicability and additional information. If you try this configuration, it would help if you can comment here on your results, as assembly configurations are always subject to improvement

I am also working on a corresponding configuration for phased assembly, and after that also on configurations for R10, slow mode reads. I will probably create a new Shasta release when these configurations are available and turned into built-in configurations that can be used without the need to access the corresponding files.

@yuz682 @rob234king

colindaven commented 1 year ago

Dear @paoloshasta . Thanks for this, we are very interested as well. Just a quick question - What do you mean by R10 fast mode ?

AFAIK we have our first q20 data, but not whether it is "fast" or "slow" mode. Does this refer to the speed of the DNA running through the nanopore, which will be configurable in the future (or perhaps even now, at least for your team).

paoloshasta commented 1 year ago

My knowledge of the sequencing process is limited, but my understanding is that you can run DNA through the nanopore at two speeds: a faster speed that gives slightly lower quality and higher yield (and runs faster, of course), and a slow mode that gives the best quality, at a price of some reduction in yield and sequencing speed. The reads I used to create this assembly configuration were created using the "fast" mode.

I found some mention of that here, but I will ask one of my colleagues more knowledgeable on the sequencing technology to post a more precise answer.

mitenjain commented 1 year ago

@paoloshasta is correct in that "fast mode" represents R10 sequencing at ~400 bases per second. We are also assessing performance for "slow mode" where the sequencing speed is ~260 bases per second (and yields slightly higher accuracy but a lower throughput).

By default, the R10 or Q20 data are in the "fast" (i.e. 400 bps) mode. This is likely what you have been using @colindaven!

Please let us know if you have any questions.

paoloshasta commented 1 year ago

I also added an assembly configuration for phased diploid assembly under the same conditions. It is shasta/conf/Nanopore-Phased-R10-Fast-Nov2022.conf, direct download here.

Please see the comments at the beginning of the file for applicability and expected results.

colindaven commented 1 year ago

Thanks for the details.

We have the interesting situation that we have Q10 and Q20 data for the same plant genotype.

We expect genome size 700-750MB. Datasets should be about 100X coverage.

Q10

Plant Apr 2021 config - 12 MB N50 (very decent assembly)
Nanopore May 2022 config - 24 MB N50 (really great!)
both asms have around 600 MB assembled

Q20

config Nanopore-R10-Fast-Nov2022.conf - (with 15kbp minimum readlength) 0.2 MB N50
only 472 MB assembled
config Nanopore-R10-Fast-Nov2022.conf - (with 10kbp minimum readlength) 0.19 MB N50
only 470 MB assembled
Most of the older configs don't work for Q20 data, or produce equally bad or worse results

We can supply more diagnostic files from the run if needed, here are some I hope are helpful.

From the Q20 min Read length 15kbp run

DisjointSetsHistogram.csv Binned-ReadLengthHistogram.csv LowHashBucketHistogram.csv.gz

From the Q20 min Read length 10kbp run AssemblySummary.zip

rob234king commented 1 year ago

Just catching up, I've not tried new configs for Shasta yet but would be good to know what version of guppy was basecalling done? The chimera rate is much lower now I think with latest guppy config changes which might impact on the assembly with shasta.

I found with my assembly that long reads and less coverage was better and too much coverage led to breaks in the assembly with shasta but limited experience and not the latest shasta configs.

Have you tried both duplex and 1D reads in the assembly?

paoloshasta commented 1 year ago

From the assembly summary I see that there are only about 2 million alignments for 6 million reads. For a good assembly we typically need 6 to 10 alignments per read, so it is not surprising that the resulting assembly was fragmented and incomplete.

There were almost 600 million alignment candidates found by the MinHash algorithm, so the problem is not there. Rather, the alignment criteria used by that assembly configuration are too strict for the reads you are using. This must mean that the reads you are using have lower accuracy than the ones I used to create these assembly configurations. Here are some reasons this could have happened:

Differences in the sequencing process. This is possible because R10 reads are not well settled.
Differences in the base caller. Did you request "super" accuracy when running the base caller?
I have occasionally noticed that read accuracy on plants is not as good as for human genomes. It is possible that some phased of the sequence or basecalling process are optimized for human genomes.

Whatever the reason is, the cure will be to relax alignment criteria. We can do this in two ways:

You do it and I can give input along the way. If you want to do it this way, the first step would be to override --Align.minAlignedFraction with a lower value than the value 0.85 used in that assembly configuration. This parameter is the minimum fraction of aligned markers for an alignment to be accepted as valid.
I would be happy to do it if you can give me access to the reads. If you want to do it this way, you can contact me privately at the e-mail address indicated in my GitHub profile. Of course, I would use the data for no other purpose than helping you optimize the assembly configuration. With your approval, I could then turn the resulting assembly configuration into a new built-in assembly configuration in Shasta.

A separate issue is that the R10 assembly configurations I created are optimized for much lower coverage than what you have. This may generate additional issues later, but the first thing to do is to get a sufficient number of alignments.

paoloshasta commented 1 year ago

@rob234king is correct about chimera rates (an estimate is reported by Shasta in stdout.log), but the main problem here is the lack of a sufficient number of alignments. The reads I worked with had chimera rates around or below 2%.

@rob234king the reasons that assemblies worsen at higher coverage is that existing configurations are optimized for coverage between 30x and 80x. It may be possible to create configurations optimized for higher coverage. If you have a lot of coverage, I suggest using --Reads.minReadLength or --Reads.desiredCoverage to reduce coverage by using only the longer reads.

colindaven commented 1 year ago

Thanks for the very detailed suggestions @paoloshasta

They helped a lot. Now at 14MB N50 (with Flye we got 24MB, so still some room for improvement, but far better than our last shasta N50 of ca 200kbp).

Chimera rate appears low: stdout.log:Chimera rate is 0.00901411
--Reads.desiredCoverage appears buggy in our hands. Setting to 40x or 50x caused minReadLength to be auto-set to 0. Then segfault.
Apologies, we cannot make the reads available
Guppy basecaller was 6.06 super acc: r104_prom_sup_g606
Setting --Align.minAlignedFraction to 0.70 and --Reads.minReadLength to 35000 led to this assembly. We're not done and are trying others, eg 0.75 and 10000 for the above two parameters. However, it's worth noting coverage is excessively high for this dataset (about 200X), and that is why I set the read length so high, namely to reduce the amount of data going in.
These are not duplex reads, all simplex.

Assembly stats are pretty good. Even the genome size is closer to what is expected and closer to the Flye result (we got 600MB with the Q10 assembly mentioned in the previous comments, here we have 642 MB). There's far too many tiny contigs but these would be removed anyway.

Assembly.fasta
A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.3224  0.1792  0.1789  0.3195  0.0000  0.0000  0.0000  0.3581  0.0662

Main genome scaffold total:             4501
Main genome contig total:               4501
Main genome scaffold sequence total:    642.704 MB
Main genome contig sequence total:      642.704 MB      0.000% gap
Main genome scaffold N/L50:             14/14.206 MB
Main genome contig N/L50:               14/14.206 MB
Main genome scaffold N/L90:             43/4.073 MB
Main genome contig N/L90:               43/4.073 MB
Max scaffold length:                    58.803 MB
Max contig length:                      58.803 MB
Number of scaffolds > 50 KB:            98
% main genome in scaffolds > 50 KB:     98.70%

Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig  
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All                  4,501           4,501     642,703,675     642,703,675   100.00%
     50                  4,203           4,203     642,696,684     642,696,684   100.00%
    100                  4,089           4,089     642,688,244     642,688,244   100.00%
    250                  3,726           3,726     642,624,488     642,624,488   100.00%
    500                  3,244           3,244     642,442,737     642,442,737   100.00%
   1 KB                  2,399           2,399     641,816,683     641,816,683   100.00%
 2.5 KB                  1,055           1,055     639,647,535     639,647,535   100.00%
   5 KB                    425             425     637,487,816     637,487,816   100.00%
  10 KB                    178             178     635,798,231     635,798,231   100.00%
  25 KB                    115             115     634,951,100     634,951,100   100.00%
  50 KB                     98              98     634,323,653     634,323,653   100.00%
 100 KB                     89              89     633,712,460     633,712,460   100.00%
 250 KB                     72              72     631,314,238     631,314,238   100.00%
 500 KB                     66              66     629,159,072     629,159,072   100.00%
   1 MB                     60              60     624,566,030     624,566,030   100.00%
 2.5 MB                     54              54     615,126,899     615,126,899   100.00%
   5 MB                     36              36     549,604,115     549,604,115   100.00%
  10 MB                     25              25     463,933,759     463,933,759   100.00%
  25 MB                      5               5     190,351,282     190,351,282   100.00%
  50 MB                      1               1      58,803,300      58,803,300   100.00%

Any further suggestions would be gratefully received.

AssemblySummary.zip

paoloshasta commented 1 year ago

Nice, getting there! Here are some comments.

Chimera rate appears low: stdout.log:Chimera rate is 0.00901411

Good. Shasta underestimates chimera rates a bit, but this value is in the right ballpark and similar to what I had when optimizing the R10 assembly configurations.

--Reads.desiredCoverage appears buggy in our hands. Setting to 40x or 50x caused minReadLength to be auto-set to 0. Then segfault.

You have to enter the coverage in bases, not in "x" form, because Shasta does not know (and does not want to know) the genome size. For example for your 700 Mb genome --Reads.desiredCoverage 50Gb would automatically set the read length cutoff to give you 50 Gb of reads or about 70x for the genome you are working with. See here for more details on this option. But a segment fault is not acceptable and I will try and reproduce the problem, then modify the code to hopefully end with an informative message.

Apologies, we cannot make the reads available

No problem, and it looks like you are able to make progress anyway.

Guppy basecaller was 6.06 super acc: r104_prom_sup_g606

Good, I think this was similar to what was used to create the reads I was working with.

Setting --Align.minAlignedFraction to 0.70 and --Reads.minReadLength to 35000 led to this assembly.

How much coverage did you get with --Reads.minReadLength 35000?

We're not done and are trying others, eg 0.75 and 10000 for the above two parameters.

That is the way to do it. It is likely that optimal coverage is around 60x to 80x.

However, it's worth noting coverage is excessively high for this dataset (about 200X), and that is why I set the read length so high, namely to reduce the amount of data going in.

That is exactly what I do when I have a lot of coverage. The --Reads.desiredCoverage option is also useful for that.

These are not duplex reads, all simplex.

Yes, that is what I have been working with. Once I get reasonably stable duplex reads I will create assembly configurations for those too.

There's far too many tiny contigs but these would be removed anyway.

The small segments are useful in some cases, if used in conjunction with the assembly graph connectivity as contained in GFA output. If you are not interested in that information, you can just remove them. I don't want to remove them by default in Shasta because they sometimes contain useful information for some applications.

So overall things are moving in the right direction, and I would expect that you should be able to bring the assembly N50 in the 20 to 30 Mb range. But the big question is why your reads are much less accurate than the ones I worked with, forcing relaxed alignment criteria (and of course relaxing them too much can lead to other problems). Perhaps @rob234king or somebody else from ONT may be able to comment on this? Do you have a way to independently assess the accuracy of your reads? For the dataset I used to optimize these configurations, the peak of the read identity distribution was around 98.9%.

Also worth mentioning that you can look at the details of alignments by using the Shasta http server - see here for some documentation. That documentation is a bit sparse, so I would be happy to help with that if you decide to experiment with it. In addition to alignment details, the Shasta http server allows you to dig into the assembly in a lot of detail, and I use that functionality extensively when working with assembly configurations.

paoloshasta commented 1 year ago

I just created Shasta release 0.11.0, which includes 4 news configurations for R10 reads. See here for information on current Shasta assembly configurations. The R10 assembly configurations are built-in in Shasta 0.11.0, so they can be used without the need for the corresponding configuration files.

paoloshasta commented 1 year ago

@colindaven you are right that --Reads.desiredCoverage does not work with the R10 assembly configurations. I tracked it down to the fact that these assembly configurations don't use RLE, and a bug is preventing this option from working when RLE is turned off. Thank you for reporting this.

I opened a new issue #4 on that. In my case it caused Shasta to terminate with an assertion, not a segment fault, but it is certainly broken. It works correctly when RLE is turned on. So for now, when using any of the R10 assembly configurations, you will have to adjust coverage manually via --Reads.minReadLength. You can use the information in Binned-ReadLengthHistogram.csv as guidance when doing that.

paoloshasta commented 1 year ago

I checked with the people who did the sequencing for the reads I used to optimize the Nanopore-R10-Fast-Nov2022 assembly configuration. For base calling, they used Guppy configuration dna_r10.4.1_e8.2_sup. This is different from the one you used r104_prom_sup_g606. They also used Guppy 6.2.11, which is newer than the one you used, version 6.06 (or 6.0.6?).

@colindaven I wonder if this explains why your reads seem to have lower accuracy than the "fast mode" reads I have been working with. But it could also be hardware (pore) or chemistry related (e. g. 10.4 versus 10.4.1?).

colindaven commented 1 year ago

We've been checking read accuracy and trying out all sorts of settings too. The best assembly is 14mb so far listed above, despite trying lots of further settings. When we have some more insights we'll provide them, or new assemblies with forthcoming q20 datasets in a new issue.

We'll check BUSCO scores for these assemblies and get back to you in the next week or so. Later on we'll try to assess contigs between different assemblers using synteny.

In general, the assemblies for this 750 MB genome are pretty decent and usable we feel, and very fast. However, the main advantage for Shasta for us is probably going to be with bigger genomes in the 2-5 gigabase range, where other assemblers such as Flye are non-tractable in terms of runtime.

paoloshasta commented 1 year ago

In the past, we have generally found BUSCO scores uninformative if applied to unpolished assemblies directly out of Shasta, because of the frequent indel errors. Rather, we have been running BUSCO after doing some form of polishing. However, this might be less of a concern today, with these much more accurate R10 reads.

Since you are interested in larger genomes and more generally in assembly performance, here is some possibly useful information.

For best performance you should run Shasta with --memoryMode filesystem --memoryBacking 2M, if you are not already doing it. This causes all large data structures to be stored on Linux large pages, which are faster to allocate and access. This generally results in significant improvements in assembly performance, although the benefit varies widely depending on many factors. These options require root access via sudo. This is not a problem if you are running your assemblies on cloud servers, or if you have full control of the machines you use for assembly. However it can be problematic in shared environments. If you use these options, you should run Shasta command cleanupBinaryData when you are done using the data in memory (typically, when the assembly finished, unless you plan to use the Shasta http server). More information on this here.
Memory requirements for assembly are typically around 6 to 8 bytes per input base. So for example if you wanted to assemble a 5 Gb genome at 50x you would need approximately 5x50x8 = 2000 GB or about 2 TB of memory. We have in fact been able to assemble a 5 Gb genome on a 2 TB machine, with good contiguity (N50 above 30 Mb).
If you don't have enough memory for your assembly and your system has fast SSD storage with sufficient free space, you can run Shasta with --memoryMode filesystem --memoryBacking disk (no root access required). This will slow down the assembly, however. See here for more information.
Another option to reduce memory requirements and improve performance is to reduce the marker density, usually set at 0.1 (10%). This is the fraction of k-mers that are used as markers, and is controlled via command line option --Kmers.probability. This could be especially useful for very large genomes. My experience with this is limited, but I have indications that reducing it will reduce memory requirements and assembly time almost by the same factor, with a small effect on assembly quality and contiguity. However if the marker density is reduced it may also be necessary to recalibrate other assembly parameters, especially alignment criteria.

paoloshasta commented 1 year ago

The bug in --Reads.desiredCoverage is fixed in Shasta 0.11.1.

Aline-Git commented 1 year ago

Hello,

Thanks for this very useful tool !

We would like to use Shasta to assemble ONT sequences of HEK293 cell line. We will have ~60Gbp of sequences from 10.4 chemistry (N50 ~20) and ~2Gbp of UL sequences from 9.4 sequences. I was thinking to use the configuration file 'Nanopore-Phased-R10-Slow-Nov2022.conf'. I have two questions:

Should we tune the configuration file to take into account the UL 9.04 sequences ? If yes, some advises would be precious.
Since the HEK cells are inbetween diploid and triploid cells, will there be a problem with the chromosomes that are present in three copies (other than being chimeric) ?

Thanks for your help !

Aline

rob234king commented 1 year ago

Another question on different data set: FYI: I'm only testing on a very small subset of reads at telomeres.

I have a telomere containing subset of reads and appears the default config Nanopore-R10-Slow-Nov2022 with 0.11.1 works well and I don't see errors in the assembly but ~101bp of the 3 prime end is not assembled and only part of the telomere is assembled. So two part question. Is there a parameter to get it to assemble to the length of the reads as coverage is good 101 bp to 3' end of sequence pileup? And to get more repeating sequence to assemble on the 5' end is there a parameter. As I know telomere continues at 5' I could just right something to extend it to that length but good if option to assemble more of it.

paoloshasta commented 1 year ago

To make it easier to follow the discussion I created two new issues corresponding to the last two posts:

7 for @Aline-Git's post.
8 for @rob234king's post.

Let's continue the discussion there.

paoloshasta / shasta

Assembly configurations for ONT R10 reads #1

7 for @Aline-Git's post.

8 for @rob234king's post.