paoloshasta / shasta

De novo assembly from Oxford Nanopore reads.
https://paoloshasta.github.io/shasta/
Other
74 stars 11 forks source link

Assembling large plant genome with R10 reads #11

Closed sivico26 closed 1 year ago

sivico26 commented 1 year ago

Dear @paoloshasta,

I am using shasta to assembly a challenging plant genome (~4Gb, ~80% repetitive content, ~1.74% heterozygosity, etc.). If I do a cutoff at 10k, I have around 62x of R10 Nanopore data. Since this kind of data is not yet optimized for plant genomes in Shasta, I have read the issue tracker (e.g #1 ) to guide myself.

At first, I used the recently created configs Nanopore-R10-Fast-Nov2022.conf and Nanopore-R10-Slow-Nov2022.conf, but they both gave poor results:

n       n:500   L50     min     N75     N50     N25     E-size  max     sum     name
30424   30417   3849    505     136568  296906  660008  581522  8200438 4.798e9 Flye
99129   92959   12171   500     53829   102810  190074  149927  1382546 4.551e9 FastR10
9813    9720    3268    509     14097   17452   23886   20441   175226  160.3e6 SlowR10
10164   9838    971     501     567918  1207213 2200885 1636381 10.43e6 4.216e9 Nanopore-Plants-Apr2021

The first assembly is one done with Flye (although only with around 40x), which is not great either but is useful to compare. As you can see, both the Fast and Slow configs struggled with this genome (especially the Slow that threw away 95% of the genome for some reason; I can give you the log of that run if you are interested).

However, I got curious and also tried the configuration for Plants of 2021 (last row), and this one yield much better results (at least in the Mb scale). My reasoning was that, even though R10 has shown a great improvement over R9, the gap between the improvements for Nanopore shrinks when we are dealing with Plants, since their genomes are more complex, with more weird modifications, and Nanopore normally benchmarks mostly with Human and bacteria. Thus, I reasoned that my data would be closer to the old configuration for Plants R9 than to the newer FastR10. This seems to be the case, or maybe is just a lucky strike.

Anyway, since I used an old configuration and R10 is remarkably better than R9 (even for plant genomes), I think there is still room for improvement, so I wanted to ask your help to further configure Shasta to better fit this data. This is the command line I used for the last assembly:

shasta --threads $threads --input $reads --config Nanopore-Plants-Apr2021 --Assembly.consensusCaller Modal

I added the --Assembly.consensusCaller Modal because it is what is being used in more recent configurations (i.e. FastR10), is that reasonable to you?

These are the Assembly summary and the log: AssemblySummary.html.txt stdout.log

So, how would you further tweak the configuration to improve that assembly? As the current configuration is expecting noisier R9 reads, one way to go is to be more strict with the quality of the reads/alignments since the error rates for R10 are better. Which parameters specifically would you tweak?

Thank you in advance. Sivico

P.S. I am amazed at how fast Shasta is compared to other assemblers. I am seeing hundreds of hours needed for other assemblers compared to just a few for Shasta. How did you achieve that? Anywho, big kudos for that.

paoloshasta commented 1 year ago

I agree it is likely that this assembly can be improved, probably significantly, with some optimization of assembly parameters. You need an R10 version of Nanopore-Plants-Apr2021, and I will be happy to help you develop a version of that works well for the genome you are interested in. If things work out well and you are willing to share, we could turn the end result of this process into a new Shasta built-in configuration.

First I have a couple of questions for you.

Under the assumption that the reads are "fast" mode reads basecalled with "super" accuracy, I think it is best to optimize the assembly starting from the Nanopore-R10-Fast-Nov2022 assembly rather then the Nanopore-Plants-Apr2021 assembly, even though the latter is currently giving a much better result. The reason is that, because R10 reads are so much more accurate, the R10 configurations don't use RLE (Run-Length Encoding, aka homopolymer compression) to represent the reads. And you don't want to be stuck with an assembly configuration with RLE for R10 reads, because from my experimentation this is clearly not the right thing to do.

So, under the above assumptions for the reads, I suggest that you post the following files from the Nanopore-R10-Fast-Nov2022 assembly:

With this I will have a better idea of what is going on in that assembly and I will be able to make suggestions for improvements.

The R10 configurations use --Assembly.consensusCaller Modal because a consensus called for repeat count is not needed when not using the RLE representation. Using it with the R9 configuration, which uses RLE, is not a good idea as it can reduce the quality of assembled sequence. However, this has no effect on the large scale metrics of the assembly you are currently looking at.

Thank you for the nice comments on Shasta's speed. As explained in the documentation section on computational methods, it is mostly due to using a compact representation of the reads (marker representation) during most assembly phases.

sivico26 commented 1 year ago

Dear @paoloshasta,

Thanks for the fast reply and your support. Regarding your questions:

  • Are these "fast" or "slow" mode R10 reads? if I remember correctly, "fast" means 400 bases per second and "slow" means 220 bases per second. I think most peope are using "fast" mode, and it is possible that "slow" mode is being phased out by ONT, but I really don't know about that.

Yes, I can confirm to you that the "Slow" mode of sequencing has been phased out by Nanopore.

Did you use Guppy to basecall these reads? If so, did you use a configuration with "super" accuracy? The Shasta R10 assembly configurations assume that this is the case. If you did not use "super" accuracy, I strongly suggest rerunning the base caller because that is the way to get the full accuracy of R10.

Yes, these are R10 reads basecalled on Guppy with Super-accuracy model.

If I understood correctly, the assembly made with the Plant configuration (using RLE) might ended up with a consensus sequence probably not adequate/reliable, is that right?

Anyway, the following are the requested files for the FastR10 run:

Let me know if you need more information. Cheers

paoloshasta commented 1 year ago

If I understood correctly, the assembly made with the Plant configuration (using RLE) might ended up with a consensus sequence probably not adequate/reliable, is that right?

That consensus sequence should have the typical accuracy of an R9 assembly or possibly better, but it would be hard to use that assembly as a starting point to improve the N50.

From the information you posted for the Nanopore-R10-Fast-Nov2022 assembly, I see that 58% of the reads are isolated in the read graph (see under "Read Graph" in AssemblySummary.html). The reason is that the read graph used only 18 million alignments for 11 million reads. In a healthy assembly, the typical ratio of alignments per reads is around 5 or so and the fraction of isolated reads is around 20-30%. As a result of this, the assembly is highly fragmented, as I was able to confirm by loading the gfa file in Bandage). This is a common result for assemblies with insufficient numbers of alignments.

To improve this situation you will need to loosen alignment criteria. This means that your reads are probably less accurate than the reads (from a human genome) I used to optimize the R10 assembly configuration. This is not too surprising because we already know from R9 that plant reads tend to be less accurate than human reads - from which the need for assembly configurations optimized for plant genomes.

In addition, your reads may be a bit shorter than the ones I used to optimize the R10 assembly configuration.

The key parameters you will have to experiment with are:

The R10 configuration uses --Align.minAlignedMarkerCount 1000 --Align.minAlignedFraction 0.85. You will have to experiment decreasing both of them. I suggest starting with --Align.minAlignedMarkerCount 600 and leaving --Align.minAlignedFraction unchanged. If you decide to also change --Align.minAlignedFraction, do it slowly (0.80 could be a reasonable starting point). Keep in mind the values you specify on the command line override the values specified by the R10 assembly configuration.

As you loosen alignment criteria you will see that the number of "good" alignments reported in AssemblySummary.html will increase and the fraction of isolated reads will decrease. As a result, the N50 of the assembly should increase. However, if you loosen alignment criteria too much, the N50 will start increasing again as the assembly turns from "fragmented" to "messy" (you can see that by loading the gfa file in Bandage).

From looking at LowHashBucketHistogram.csv I also suspect it may be also better to use --MinHash.minBucketSize 20 --MinHash.maxBucketSize 60. To see this, do a scatter plot of FeatureCount versus BucketSize from the data in LowHashBucketHistogram.csv (you will need to manually adjust the scales for both axes). However this will increase the number of alignment candidates and therefore assembly time. For that reason it might also be good to decrease --MinHash.minHashIterationCount from 100 to 50 or 20.

After you do some experimentation along these lines, let me know how it goes, and depending on the results I can perhaps suggest additional changes, if needed.

sivico26 commented 1 year ago

Excellent, I will start to experiment along these lines and let you know how it goes.

sivico26 commented 1 year ago

Dear @paoloshasta ,

I have been doing some assemblies and experimentation following your advice and this is what I have got so far (In summary):

n       n:500   L50     min     N75     N50     N25     E-size  max     sum     name
10164   9838    971     501     567918  1207213 2200885 1636381 10.43e6 4.216e9 Third_plants
11114   11075   1441    513     469621  934056  1674626 1339778 8539323 4.846e9 Fourth_May2022
77730   74132   7355    500     77320   171785  360161  277870  3196056 5.016e9 Fifth
20066   18619   1045    500     522640  1115534 2104084 1583193 12.03e6 4.274e9 Sixth
48482   40968   1704    500     329463  695838  1255204 887799  5547829 4.111e9 Seventh
54466   45789   2022    500     274672  579421  1034506 747930  4890758 4.083e9 Eighth
46001   38311   1550    500     365452  759775  1343173 975822  5458168 4.067e9 Ninth_frac06
44227   37145   1417    500     396607  821962  1519136 1067834 5458768 4.101e9 Ninth_frac07
41756   35245   1438    500     402163  829994  1483878 1080607 6101193 4.177e9 Tenth_aln500
38992   32734   1222    500     465272  961713  1738328 1277455 6828785 4.163e9 Tenth_aln700
40174   34004   1315    500     441652  908114  1614005 1191267 6796987 4.167e9 Tenth_linear
33947   29400   1281    500     439787  916383  1623773 1185868 7335015 4.082e9 Tenth_tight

As you can see, I have made some progress with the FastR10 configuration, but still have not managed to beat the original Plants2021 configuration.

The following is a summary table of the parametrizations I have used. Unless specified otherwise in the Params column, all the assemblies used (except the Third and Fourth, which used the default for their configs).

I chose to modify the Bucket size range to be more inclusive after looking at LowHashBucketHistogram.csv (I later discovered that this might have been detrimental, see below).

The table:

<!DOCTYPE html>

Name Config E-size (kb) Non isolated reads (%) Params Files
Third Plants2021 1636 97% --Assembly.consensusCaller Modal Third_plants.zip
Fourth May2022 1339 80%   Fourth_May2022.zip
Fifth FastR10 277 52% --Align.minAlignedMarkerCount 600 Fifth.zip
Sixth FastR10 1583 78% --Align.minAlignedMarkerCount 600 --Align.minAlignedFraction 0.8 Sixth.zip
Seventh FastR10 887 92% --Align.minAlignedMarkerCount 400 --Align.minAlignedFraction 0.7 Seventh.zip
Eight FastR10 747 97% --Align.minAlignedMarkerCount 400 --Align.minAlignedFraction 0.5 Eighth.zip
Ninth-frac06 FastR10 975 95% --Align.minAlignedMarkerCount 600 --Align.minAlignedFraction 0.6 Ninth_frac06.zip
Ninth-frac07 FastR10 1067 91.86% --Align.minAlignedMarkerCount 600 --Align.minAlignedFraction 0.7 Ninth_frac07.zip
Tenth Linear FastR10 1119 88.2% --Align.minAlignedMarkerCount 600 --Align.minAlignedFraction 0.75 tenth_linear.zip
Tenth Tight FastR10 1185 91.6% --Align.minAlignedMarkerCount 600 --Align.minAlignedFraction 0.7 --MinHash.minBucketSize 20 --MinHash.maxBucketSize 60 tenth_tight.zip
Tenth aln700 FastR10 1080 87.5% --Align.minAlignedMarkerCount 700 --Align.minAlignedFraction 0.75 tenth_aln700.zip
Tenth aln500 FastR10 1277 88.7% --Align.minAlignedMarkerCount 500 --Align.minAlignedFraction 0.75 tenth_aln500.zip

From the above, I have observed the following:

Now, I also think that I am missing another part of the picture. Given that improving the Non-isolated fraction of reads/bases, does not necessarily translates into a better assembly (e.g. Sixth vs Tenth_linear). This was the main metric I was optimizing at the very start, but now I wanted to ask you which other metric could be a target to optimize. Or given these preliminary results, what direction would you take next?

I got the feeling that I have to optimize the quality of the alignments that end-up in the graph, but this seems hard to measure. The Number of good alignments in the graph feels like a proxy to it but I found it to be a bit erratic and harder to appreciate its relationship with quality of assembly (as aforementioned). The question becomes then, what metric does?

I was also wondering about other parameters that could influence those alignments, particularly maxSkip, maxDrift, and maxTrim, since those also differ significantly between Plants2021, May2022 and FastR10 configurations, being FastR10 more strict if understand correctly. Do you think it is worth to play with them? Can they have particularly large effects on the assembly? Or their role is minor/ not worth it?

I would greatly appreciate you comments at this point since I feel I am stagnating now. I was hoping we can boost the assembly quality another order of magnitude. Do you think this is realistic?

Kind regards

paoloshasta commented 1 year ago

I would expect that you can improve assembly contiguity significantly, unless this genome is hopeless from the point of view of repeat content.

I think the situation you are describing is what I summarized in this paragraph in my last post:

As you loosen alignment criteria you will see that the number of "good" alignments reported in AssemblySummary.html will increase and the fraction of isolated reads will decrease. As a result, the N50 of the assembly should increase. However, if you loosen alignment criteria too much, the N50 will start increasing again as the assembly turns from "fragmented" to "messy" (you can see that by loading the gfa file in Bandage).

And you rephrased that in your bullet above:

There is something about the quality of the alignments on the graph that clearly helps the assembly going from lower values to higher values of the Number of good alignments kept in the read graph metric. However, at some point having more alignments on the graph is detrimental for the assembly (maybe it ends up over-complicating the graph?).

Unfortunately I don't know a good metric to measure whether the assembly graph is "fragmented" to "messy". Many of the metrics that people use for this purpose are excessively affected by very short contigs, which are inessential. Instead, I just load the assembly in Bandage and look at it. If the alignment criteria are too strict, you will see that most contigs are short and isolated. If the alignment criteria are too loose, you get a bit messy blob. Usually there is a reasonable point in between that also optimizes the N50 for the assembly.

So I suggest that you next load in Bandage some of the assembly results you got. For this purpose, I suggest using Assembly-BothStrands-NoSequence.gfa. This file is small because it does not contain any sequence, and so it generally loads reasonably quickly in Bandage. Keep in mind that the file contains both strands (so the assembly is in a sense duplicated), and that all lengths are expressed in markers. If you like you can post a couple of those files here and I can take a look too.

If this does not clarify the situation, Shasta also contains functionality to look at the details of the assembly, including the alignments, etc. This functionality is described briefly here, and I can help you get started using if you decide to do that (the documentation is not extensive, unfortunately, and probably insufficient). I use this functionality extensively both for debugging Shasta and to investigate assembly results to optimize assembly options. Unfortunately to use this functionality you have to save binary data from the assembly, which are a large dataset, particularly for a large genome like in your case.

sivico26 commented 1 year ago

Thanks again for the quick reply @paoloshasta .

I will have a look at the different graphs to try to appreciate the transition between fragmented and messy you describe. In case you want to have a look too, all the files summarizing each assembly (the five ones you requested originally) are attached as a compressed file in the column files of the table (including the Assembly-BothStrands-NoSequence.gfa).

I will come back to you when I get a better insight.

sivico26 commented 1 year ago

All right, after having a look, most of them are clearly messy, and would benefit from making the alignments more strict. There is only one that is clearly disconnected.

However, there is one that is particularly puzzling to me: Screenshot_20230223_093739

This one looks messy (although not as bad as the others) but disconnected at the same time. This is the Sixth assembly on the table above. I want to build over this one but not entirely sure which way to take here. Thoughts?

paoloshasta commented 1 year ago

Ah sorry I had missed the column with the files in your post from yesterday, and I just saw today's post.

So, Sixth is the best of the assemblies that use the FastR10 configuration, and it is mostly messy - in the sense that most of the sequence is in the messy connected component. So, as you already concluded, you need to tighten alignment criteria.

On the other hand, the Fifth assembly is highly fragmented. The only difference with Sixth is that it uses --Align.minAlignedFraction 0.85 (inherited from the FastR10 configuration). As a result, Fifth has only 31 million good alignments, versus 100 million for Sixth. And half of the reads end up being isolated in the read graph. So this tells us that --Align.minAlignedFraction 0.85 is too tight because it results in eliminating most of the alignments.

This also tells us that your reads are less accurate than the human reads I used to optimize the FastR10 configuration. This is not too surprising as I have often seen in the past that plant nanopore reads tend to be less accurate than their human counterparts.

Given this, you should stay with --Align.minAlignedFraction 0.8 like in Sixth, but tighten alignment criteria using --Align.minAlignedMarkerCount. The FastR10 configuration uses --Align.minAlignedMarkerCount 1000, and so I would try with values somewhere in the interval [600,1000] (or even more, given the high repeat content of the genome you are working with). This should have the effect of reducing the size of the large connected component in the assembly, which will be broken into fragmented portions. As this is done, the assembly N50 should hit an optimal point. Your assemblies run in a reasonable number of hours, so it should not be too bad to do a few more iterations along these lines.

sivico26 commented 1 year ago

Dear @paoloshasta,

After making the grid of --Align.minAlignedMarkerCount with the values you suggested, this is the summary of the results:

<!DOCTYPE html>

Name Config E-size (kb) Non isolated reads (%) Params Graph comment files
Twelfth aln700 FastR10 1556 78.6% --Align.minAlignedMarkerCount 700 --Align.minAlignedFraction 0.8 Flurry ball as well twelveth_aln700.zip
Twelfth aln800 FastR10 1600 77.5% --Align.minAlignedMarkerCount 800 --Align.minAlignedFraction 0.8 Flurry ball as well twelveth_aln800.zip
Twelfth aln900 FastR10 1711 74.1% --Align.minAlignedMarkerCount 900 --Align.minAlignedFraction 0.8 A bit clearer than the previous, a bit disconnected twelveth_aln900.zip
Twelfth aln950 FastR10 1689 71.3% --Align.minAlignedMarkerCount 950 --Align.minAlignedFraction 0.8 No flurry ball, but connection can clearly improve twelveth_aln950.zip
Twelfth aln1000 FastR10 1596 68.5% --Align.minAlignedMarkerCount 1000 --Align.minAlignedFraction 0.8 Now the disconnection is stronger twelveth_aln1000.zip
Twelfth aln900frac078 FastR10 1666 78.6% --Align.minAlignedMarkerCount 900 --Align.minAlignedFraction 0.8 Flurry ball came back twelveth_aln900frac078.zip

The optimal you mention is around 900 (Given that at 1000, the graph is already disconnected). However, I hope to achieve better results. Do you have any idea how to break this tradeoff with this configuration? Is another parameter worth trying?

We knew this genome was very repetitive, but the struggles so far seem to indicate that there is a big proportion of very large repeats (>25kb), would you agree? Is there a parameter that would help especially in resolving repeats?

I could not resist and I did try to tweak from another configuration, May2022 to be precise. Seems like the optimal point is better than with FastR10. But there is not much room for tweaking (at least in the alignment settings) given the strategy of the configuration (which is to be very alignment inclusive to filter later).

These are the results I have had:

<!DOCTYPE html> Name Config E-size (kb) Non isolated reads (%) Params Graph comment files
Thirteenth aln700 May2022 1966 82% --Align.minAlignedMarkerCount 700 Not so flurry ball, more disconnected thirteenth_May_aln700.zip
Thirteenth aln800 May2022 1745 74% --Align.minAlignedMarkerCount 800 Completely disconnected thirteenth_May_aln800.zip
Thirteenth aln900 May2022 1300 64.7% --Align.minAlignedMarkerCount 900 Completely disconnected thirteenth_May_aln900.zip
Thirteenth aln600wide May2022 1861 91.1% --Align.minAlignedMarkerCount 600 --MinHash.minHashIterationCount 100 --MinHash.minBucketSize 15 --MinHash.maxBucketSize 70 Very flurry ball thirteenth_May_aln600wide.zip
Thirteenth aln600tight May2022 1882 91.1% --Align.minAlignedMarkerCount 600 --MinHash.minHashIterationCount 100 --MinHash.minBucketSize 20 --MinHash.maxBucketSize 60 Very flurry ball thirteenth_May_aln600tigth.zip

(Remember that I usually add --MinHash.minBucketSize 15 and --MinHash.maxBucketSize 70 unless otherwise specified).

As you can see, there are several that match or surpass the best FastR10 assembly. Besides the aforementioned "strategy", the main difference seems to be the RLE encoding, which seems to help. Can you remind me why RLE mode was not considered the way to go? Might it be worth reconsideration?

As always, your thoughts are kindly appreciated. Cheers

paoloshasta commented 1 year ago

Yes, the Twelfth-aln sequence is about flat in terms of contiguity, which must be limited by the repeat content of the genome you are working on. I agree with your assessment that there must be a high content of long repeats. I am working on new assembly methods for Shasta that should significantly improve assemblies in hard/repetitive regions, but it will be some time before that work becomes usable.

Incidentally, I noticed that in your Twelfth-aln series the assembly N50 as reported by Shasta decreases steadily from 1.116 Mb at --Align.minAlignedMarkerCount 600 to 0.922 Mb at --Align.minAlignedMarkerCount 1000, and does not have an optimal point in that range. This disagrees with the E-size metric you use, which does show an optimal point. However it is true that both measures show little change in contiguity.

My conclusion that it is better not to use RLE for R10 was obtained for human genomes, and it is entirely possible that it does not hold in your case. Optimizing starting from the R9 May2022 configuration then makes perfect sense. However that configuration uses --ReadGraph.creationMethod 2, which does automatic selection of alignment criteria. So the Thirteenth series of assemblies did not do what you intended, as you can see from the section entitled Alignment criteria actually used for creation of the read graph in AssemblySummary.html.

If you want to optimize the May2022 configuration for your situation, you need to keep the following in mind:

paoloshasta commented 1 year ago

I am closing this due to lack of discussion. Please reopen it or create a new issue if additional discussion topics arise.