Closed sivico26 closed 1 year ago
I agree it is likely that this assembly can be improved, probably significantly, with some optimization of assembly parameters. You need an R10 version of Nanopore-Plants-Apr2021
, and I will be happy to help you develop a version of that works well for the genome you are interested in. If things work out well and you are willing to share, we could turn the end result of this process into a new Shasta built-in configuration.
First I have a couple of questions for you.
Under the assumption that the reads are "fast" mode reads basecalled with "super" accuracy, I think it is best to optimize the assembly starting from the Nanopore-R10-Fast-Nov2022
assembly rather then the Nanopore-Plants-Apr2021
assembly, even though the latter is currently giving a much better result. The reason is that, because R10 reads are so much more accurate, the R10 configurations don't use RLE (Run-Length Encoding, aka homopolymer compression) to represent the reads. And you don't want to be stuck with an assembly configuration with RLE for R10 reads, because from my experimentation this is clearly not the right thing to do.
So, under the above assumptions for the reads, I suggest that you post the following files from the Nanopore-R10-Fast-Nov2022
assembly:
AssemblySummary.html
stdout.log
Assembly-BothStrands-NoSequence.gfa
LowHashBucketHistogram.csv
DisjointSetsHistogram.csv
With this I will have a better idea of what is going on in that assembly and I will be able to make suggestions for improvements.
The R10 configurations use --Assembly.consensusCaller Modal
because a consensus called for repeat count is not needed when not using the RLE representation. Using it with the R9 configuration, which uses RLE, is not a good idea as it can reduce the quality of assembled sequence. However, this has no effect on the large scale metrics of the assembly you are currently looking at.
Thank you for the nice comments on Shasta's speed. As explained in the documentation section on computational methods, it is mostly due to using a compact representation of the reads (marker representation) during most assembly phases.
Dear @paoloshasta,
Thanks for the fast reply and your support. Regarding your questions:
- Are these "fast" or "slow" mode R10 reads? if I remember correctly, "fast" means 400 bases per second and "slow" means 220 bases per second. I think most peope are using "fast" mode, and it is possible that "slow" mode is being phased out by ONT, but I really don't know about that.
Yes, I can confirm to you that the "Slow" mode of sequencing has been phased out by Nanopore.
Did you use Guppy to basecall these reads? If so, did you use a configuration with "super" accuracy? The Shasta R10 assembly configurations assume that this is the case. If you did not use "super" accuracy, I strongly suggest rerunning the base caller because that is the way to get the full accuracy of R10.
Yes, these are R10 reads basecalled on Guppy with Super-accuracy model.
If I understood correctly, the assembly made with the Plant configuration (using RLE) might ended up with a consensus sequence probably not adequate/reliable, is that right?
Anyway, the following are the requested files for the FastR10 run:
Let me know if you need more information. Cheers
If I understood correctly, the assembly made with the Plant configuration (using RLE) might ended up with a consensus sequence probably not adequate/reliable, is that right?
That consensus sequence should have the typical accuracy of an R9 assembly or possibly better, but it would be hard to use that assembly as a starting point to improve the N50.
From the information you posted for the Nanopore-R10-Fast-Nov2022
assembly, I see that 58% of the reads are isolated in the read graph (see under "Read Graph" in AssemblySummary.html
). The reason is that the read graph used only 18 million alignments for 11 million reads. In a healthy assembly, the typical ratio of alignments per reads is around 5 or so and the fraction of isolated reads is around 20-30%. As a result of this, the assembly is highly fragmented, as I was able to confirm by loading the gfa file in Bandage). This is a common result for assemblies with insufficient numbers of alignments.
To improve this situation you will need to loosen alignment criteria. This means that your reads are probably less accurate than the reads (from a human genome) I used to optimize the R10 assembly configuration. This is not too surprising because we already know from R9 that plant reads tend to be less accurate than human reads - from which the need for assembly configurations optimized for plant genomes.
In addition, your reads may be a bit shorter than the ones I used to optimize the R10 assembly configuration.
The key parameters you will have to experiment with are:
--Align.minAlignedMarkerCount
, which controls the minimum length (in markers) for a alignment between two reads to be used in the assembly.--Align.minAlignedFraction
, which controls the minimum acceptable fraction of aligned markers in an alignment. That is, it controls the alignment accuracy.The R10 configuration uses --Align.minAlignedMarkerCount 1000 --Align.minAlignedFraction 0.85
. You will have to experiment decreasing both of them. I suggest starting with --Align.minAlignedMarkerCount 600
and leaving --Align.minAlignedFraction
unchanged. If you decide to also change --Align.minAlignedFraction
, do it slowly (0.80 could be a reasonable starting point). Keep in mind the values you specify on the command line override the values specified by the R10 assembly configuration.
As you loosen alignment criteria you will see that the number of "good" alignments reported in AssemblySummary.html
will increase and the fraction of isolated reads will decrease. As a result, the N50 of the assembly should increase. However, if you loosen alignment criteria too much, the N50 will start increasing again as the assembly turns from "fragmented" to "messy" (you can see that by loading the gfa file in Bandage).
From looking at LowHashBucketHistogram.csv
I also suspect it may be also better to use --MinHash.minBucketSize 20 --MinHash.maxBucketSize 60
. To see this, do a scatter plot of FeatureCount
versus BucketSize
from the data in LowHashBucketHistogram.csv
(you will need to manually adjust the scales for both axes). However this will increase the number of alignment candidates and therefore assembly time. For that reason it might also be good to decrease --MinHash.minHashIterationCount
from 100 to 50 or 20.
After you do some experimentation along these lines, let me know how it goes, and depending on the results I can perhaps suggest additional changes, if needed.
Excellent, I will start to experiment along these lines and let you know how it goes.
Dear @paoloshasta ,
I have been doing some assemblies and experimentation following your advice and this is what I have got so far (In summary):
n n:500 L50 min N75 N50 N25 E-size max sum name
10164 9838 971 501 567918 1207213 2200885 1636381 10.43e6 4.216e9 Third_plants
11114 11075 1441 513 469621 934056 1674626 1339778 8539323 4.846e9 Fourth_May2022
77730 74132 7355 500 77320 171785 360161 277870 3196056 5.016e9 Fifth
20066 18619 1045 500 522640 1115534 2104084 1583193 12.03e6 4.274e9 Sixth
48482 40968 1704 500 329463 695838 1255204 887799 5547829 4.111e9 Seventh
54466 45789 2022 500 274672 579421 1034506 747930 4890758 4.083e9 Eighth
46001 38311 1550 500 365452 759775 1343173 975822 5458168 4.067e9 Ninth_frac06
44227 37145 1417 500 396607 821962 1519136 1067834 5458768 4.101e9 Ninth_frac07
41756 35245 1438 500 402163 829994 1483878 1080607 6101193 4.177e9 Tenth_aln500
38992 32734 1222 500 465272 961713 1738328 1277455 6828785 4.163e9 Tenth_aln700
40174 34004 1315 500 441652 908114 1614005 1191267 6796987 4.167e9 Tenth_linear
33947 29400 1281 500 439787 916383 1623773 1185868 7335015 4.082e9 Tenth_tight
As you can see, I have made some progress with the FastR10
configuration, but still have not managed to beat the original Plants2021
configuration.
The following is a summary table of the parametrizations I have used. Unless specified otherwise in the Params column, all the assemblies used (except the Third and Fourth, which used the default for their configs).
--MinHash.minBucketSize 15
--MinHash.maxBucketSize 70
--MinHash.minHashIterationCount 50
I chose to modify the Bucket size range to be more inclusive after looking at LowHashBucketHistogram.csv
(I later discovered that this might have been detrimental, see below).
The table:
<!DOCTYPE html>
Name | Config | E-size (kb) | Non isolated reads (%) | Params | Files |
---|---|---|---|---|---|
Third | Plants2021 | 1636 | 97% | --Assembly.consensusCaller Modal | Third_plants.zip |
Fourth | May2022 | 1339 | 80% | Fourth_May2022.zip | |
Fifth | FastR10 | 277 | 52% | --Align.minAlignedMarkerCount 600 | Fifth.zip |
Sixth | FastR10 | 1583 | 78% | --Align.minAlignedMarkerCount 600 --Align.minAlignedFraction 0.8 | Sixth.zip |
Seventh | FastR10 | 887 | 92% | --Align.minAlignedMarkerCount 400 --Align.minAlignedFraction 0.7 | Seventh.zip |
Eight | FastR10 | 747 | 97% | --Align.minAlignedMarkerCount 400 --Align.minAlignedFraction 0.5 | Eighth.zip |
Ninth-frac06 | FastR10 | 975 | 95% | --Align.minAlignedMarkerCount 600 --Align.minAlignedFraction 0.6 | Ninth_frac06.zip |
Ninth-frac07 | FastR10 | 1067 | 91.86% | --Align.minAlignedMarkerCount 600 --Align.minAlignedFraction 0.7 | Ninth_frac07.zip |
Tenth Linear | FastR10 | 1119 | 88.2% | --Align.minAlignedMarkerCount 600 --Align.minAlignedFraction 0.75 | tenth_linear.zip |
Tenth Tight | FastR10 | 1185 | 91.6% | --Align.minAlignedMarkerCount 600 --Align.minAlignedFraction 0.7 --MinHash.minBucketSize 20 --MinHash.maxBucketSize 60 | tenth_tight.zip |
Tenth aln700 | FastR10 | 1080 | 87.5% | --Align.minAlignedMarkerCount 700 --Align.minAlignedFraction 0.75 | tenth_aln700.zip |
Tenth aln500 | FastR10 | 1277 | 88.7% | --Align.minAlignedMarkerCount 500 --Align.minAlignedFraction 0.75 | tenth_aln500.zip |
From the above, I have observed the following:
--Align.minAlignedFraction
) on the assembly.Number of good alignments kept in the read graph
metric. However, at some point having more alignments on the graph is detrimental for the assembly (maybe it ends up over-complicating the graph?).Align
parameters (they are giving very similar results),Now, I also think that I am missing another part of the picture. Given that improving the Non-isolated fraction of reads/bases, does not necessarily translates into a better assembly (e.g. Sixth vs Tenth_linear). This was the main metric I was optimizing at the very start, but now I wanted to ask you which other metric could be a target to optimize. Or given these preliminary results, what direction would you take next?
I got the feeling that I have to optimize the quality of the alignments that end-up in the graph, but this seems hard to measure. The Number of good alignments in the graph
feels like a proxy to it but I found it to be a bit erratic and harder to appreciate its relationship with quality of assembly (as aforementioned). The question becomes then, what metric does?
I was also wondering about other parameters that could influence those alignments, particularly maxSkip
, maxDrift
, and maxTrim
, since those also differ significantly between Plants2021
, May2022
and FastR10
configurations, being FastR10
more strict if understand correctly. Do you think it is worth to play with them? Can they have particularly large effects on the assembly? Or their role is minor/ not worth it?
I would greatly appreciate you comments at this point since I feel I am stagnating now. I was hoping we can boost the assembly quality another order of magnitude. Do you think this is realistic?
Kind regards
I would expect that you can improve assembly contiguity significantly, unless this genome is hopeless from the point of view of repeat content.
I think the situation you are describing is what I summarized in this paragraph in my last post:
As you loosen alignment criteria you will see that the number of "good" alignments reported in AssemblySummary.html will increase and the fraction of isolated reads will decrease. As a result, the N50 of the assembly should increase. However, if you loosen alignment criteria too much, the N50 will start increasing again as the assembly turns from "fragmented" to "messy" (you can see that by loading the gfa file in Bandage).
And you rephrased that in your bullet above:
There is something about the quality of the alignments on the graph that clearly helps the assembly going from lower values to higher values of the Number of good alignments kept in the read graph metric. However, at some point having more alignments on the graph is detrimental for the assembly (maybe it ends up over-complicating the graph?).
Unfortunately I don't know a good metric to measure whether the assembly graph is "fragmented" to "messy". Many of the metrics that people use for this purpose are excessively affected by very short contigs, which are inessential. Instead, I just load the assembly in Bandage and look at it. If the alignment criteria are too strict, you will see that most contigs are short and isolated. If the alignment criteria are too loose, you get a bit messy blob. Usually there is a reasonable point in between that also optimizes the N50 for the assembly.
So I suggest that you next load in Bandage some of the assembly results you got. For this purpose, I suggest using Assembly-BothStrands-NoSequence.gfa
. This file is small because it does not contain any sequence, and so it generally loads reasonably quickly in Bandage. Keep in mind that the file contains both strands (so the assembly is in a sense duplicated), and that all lengths are expressed in markers. If you like you can post a couple of those files here and I can take a look too.
If this does not clarify the situation, Shasta also contains functionality to look at the details of the assembly, including the alignments, etc. This functionality is described briefly here, and I can help you get started using if you decide to do that (the documentation is not extensive, unfortunately, and probably insufficient). I use this functionality extensively both for debugging Shasta and to investigate assembly results to optimize assembly options. Unfortunately to use this functionality you have to save binary data from the assembly, which are a large dataset, particularly for a large genome like in your case.
Thanks again for the quick reply @paoloshasta .
I will have a look at the different graphs to try to appreciate the transition between fragmented and messy you describe. In case you want to have a look too, all the files summarizing each assembly (the five ones you requested originally) are attached as a compressed file in the column files
of the table (including the Assembly-BothStrands-NoSequence.gfa
).
I will come back to you when I get a better insight.
All right, after having a look, most of them are clearly messy, and would benefit from making the alignments more strict. There is only one that is clearly disconnected.
However, there is one that is particularly puzzling to me:
This one looks messy (although not as bad as the others) but disconnected at the same time. This is the Sixth
assembly on the table above. I want to build over this one but not entirely sure which way to take here. Thoughts?
Ah sorry I had missed the column with the files in your post from yesterday, and I just saw today's post.
So, Sixth
is the best of the assemblies that use the FastR10
configuration, and it is mostly messy - in the sense that most of the sequence is in the messy connected component. So, as you already concluded, you need to tighten alignment criteria.
On the other hand, the Fifth
assembly is highly fragmented. The only difference with Sixth
is that it uses --Align.minAlignedFraction 0.85
(inherited from the FastR10
configuration). As a result, Fifth
has only 31 million good alignments, versus 100 million for Sixth
. And half of the reads end up being isolated in the read graph. So this tells us that --Align.minAlignedFraction 0.85
is too tight because it results in eliminating most of the alignments.
This also tells us that your reads are less accurate than the human reads I used to optimize the FastR10
configuration. This is not too surprising as I have often seen in the past that plant nanopore reads tend to be less accurate than their human counterparts.
Given this, you should stay with --Align.minAlignedFraction 0.8
like in Sixth
, but tighten alignment criteria using --Align.minAlignedMarkerCount
. The FastR10
configuration uses --Align.minAlignedMarkerCount 1000
, and so I would try with values somewhere in the interval [600,1000] (or even more, given the high repeat content of the genome you are working with). This should have the effect of reducing the size of the large connected component in the assembly, which will be broken into fragmented portions. As this is done, the assembly N50 should hit an optimal point. Your assemblies run in a reasonable number of hours, so it should not be too bad to do a few more iterations along these lines.
Dear @paoloshasta,
After making the grid of --Align.minAlignedMarkerCount
with the values you suggested, this is the summary of the results:
<!DOCTYPE html>
Name | Config | E-size (kb) | Non isolated reads (%) | Params | Graph comment | files |
---|---|---|---|---|---|---|
Twelfth aln700 | FastR10 | 1556 | 78.6% | --Align.minAlignedMarkerCount 700 --Align.minAlignedFraction 0.8 | Flurry ball as well | twelveth_aln700.zip |
Twelfth aln800 | FastR10 | 1600 | 77.5% | --Align.minAlignedMarkerCount 800 --Align.minAlignedFraction 0.8 | Flurry ball as well | twelveth_aln800.zip |
Twelfth aln900 | FastR10 | 1711 | 74.1% | --Align.minAlignedMarkerCount 900 --Align.minAlignedFraction 0.8 | A bit clearer than the previous, a bit disconnected | twelveth_aln900.zip |
Twelfth aln950 | FastR10 | 1689 | 71.3% | --Align.minAlignedMarkerCount 950 --Align.minAlignedFraction 0.8 | No flurry ball, but connection can clearly improve | twelveth_aln950.zip |
Twelfth aln1000 | FastR10 | 1596 | 68.5% | --Align.minAlignedMarkerCount 1000 --Align.minAlignedFraction 0.8 | Now the disconnection is stronger | twelveth_aln1000.zip |
Twelfth aln900frac078 | FastR10 | 1666 | 78.6% | --Align.minAlignedMarkerCount 900 --Align.minAlignedFraction 0.8 | Flurry ball came back | twelveth_aln900frac078.zip |
The optimal you mention is around 900 (Given that at 1000, the graph is already disconnected). However, I hope to achieve better results. Do you have any idea how to break this tradeoff with this configuration? Is another parameter worth trying?
We knew this genome was very repetitive, but the struggles so far seem to indicate that there is a big proportion of very large repeats (>25kb), would you agree? Is there a parameter that would help especially in resolving repeats?
I could not resist and I did try to tweak from another configuration, May2022
to be precise. Seems like the optimal point is better than with FastR10
. But there is not much room for tweaking (at least in the alignment settings) given the strategy of the configuration (which is to be very alignment inclusive to filter later).
These are the results I have had:
<!DOCTYPE html> Name | Config | E-size (kb) | Non isolated reads (%) | Params | Graph comment | files |
---|---|---|---|---|---|---|
Thirteenth aln700 | May2022 | 1966 | 82% | --Align.minAlignedMarkerCount 700 | Not so flurry ball, more disconnected | thirteenth_May_aln700.zip |
Thirteenth aln800 | May2022 | 1745 | 74% | --Align.minAlignedMarkerCount 800 | Completely disconnected | thirteenth_May_aln800.zip |
Thirteenth aln900 | May2022 | 1300 | 64.7% | --Align.minAlignedMarkerCount 900 | Completely disconnected | thirteenth_May_aln900.zip |
Thirteenth aln600wide | May2022 | 1861 | 91.1% | --Align.minAlignedMarkerCount 600 --MinHash.minHashIterationCount 100 --MinHash.minBucketSize 15 --MinHash.maxBucketSize 70 | Very flurry ball | thirteenth_May_aln600wide.zip |
Thirteenth aln600tight | May2022 | 1882 | 91.1% | --Align.minAlignedMarkerCount 600 --MinHash.minHashIterationCount 100 --MinHash.minBucketSize 20 --MinHash.maxBucketSize 60 | Very flurry ball | thirteenth_May_aln600tigth.zip |
(Remember that I usually add --MinHash.minBucketSize 15
and --MinHash.maxBucketSize 70
unless otherwise specified).
As you can see, there are several that match or surpass the best FastR10
assembly. Besides the aforementioned "strategy", the main difference seems to be the RLE encoding, which seems to help. Can you remind me why RLE mode was not considered the way to go? Might it be worth reconsideration?
As always, your thoughts are kindly appreciated. Cheers
Yes, the Twelfth-aln
sequence is about flat in terms of contiguity, which must be limited by the repeat content of the genome you are working on. I agree with your assessment that there must be a high content of long repeats. I am working on new assembly methods for Shasta that should significantly improve assemblies in hard/repetitive regions, but it will be some time before that work becomes usable.
Incidentally, I noticed that in your Twelfth-aln
series the assembly N50 as reported by Shasta decreases steadily from 1.116 Mb at --Align.minAlignedMarkerCount 600
to 0.922 Mb at --Align.minAlignedMarkerCount 1000
, and does not have an optimal point in that range. This disagrees with the E-size
metric you use, which does show an optimal point. However it is true that both measures show little change in contiguity.
My conclusion that it is better not to use RLE for R10 was obtained for human genomes, and it is entirely possible that it does not hold in your case. Optimizing starting from the R9 May2022
configuration then makes perfect sense. However that configuration uses --ReadGraph.creationMethod 2
, which does automatic selection of alignment criteria. So the Thirteenth
series of assemblies did not do what you intended, as you can see from the section entitled Alignment criteria actually used for creation of the read graph
in AssemblySummary.html
.
If you want to optimize the May2022
configuration for your situation, you need to keep the following in mind:
--ReadGraph.creationMethod 0
to make sure the assembly actually uses the alignment criteria you specify.--Align.minAlignedFraction
greater than 0.8 (this will require some optimization). Using RLE will improve the quality of alignments in marker space, so the optimal value of --Align.minAlignedFraction
will be higher.--Align.minAlignedMarkerCount
, keep in mind that, when not using RLE, there is a marker every 10 bases on average (controlled by --Kmers.probability 0.1
). When using RLE, there is a marker every 10 RLE bases, on average. However 10 RLE bases correspond on average to about 14 real bases in typical genome sequence, due to the presence of repeated bases. So for example if you use --Align.minAlignedMarkerCount 1000
, that corresponds to 10 Kb when not using RLE (R10 configuration) and 14 Kb when using RLE (R9 configurations). So it is likely that the optimal value of --Align.minAlignedMarkerCount
will be lower when using RLE.I am closing this due to lack of discussion. Please reopen it or create a new issue if additional discussion topics arise.
Dear @paoloshasta,
I am using
shasta
to assembly a challenging plant genome (~4Gb, ~80% repetitive content, ~1.74% heterozygosity, etc.). If I do a cutoff at 10k, I have around 62x of R10 Nanopore data. Since this kind of data is not yet optimized for plant genomes in Shasta, I have read the issue tracker (e.g #1 ) to guide myself.At first, I used the recently created configs
Nanopore-R10-Fast-Nov2022.conf
andNanopore-R10-Slow-Nov2022.conf
, but they both gave poor results:The first assembly is one done with Flye (although only with around 40x), which is not great either but is useful to compare. As you can see, both the
Fast
andSlow
configs struggled with this genome (especially theSlow
that threw away 95% of the genome for some reason; I can give you the log of that run if you are interested).However, I got curious and also tried the configuration for Plants of 2021 (last row), and this one yield much better results (at least in the Mb scale). My reasoning was that, even though R10 has shown a great improvement over R9, the gap between the improvements for Nanopore shrinks when we are dealing with Plants, since their genomes are more complex, with more weird modifications, and Nanopore normally benchmarks mostly with Human and bacteria. Thus, I reasoned that my data would be closer to the old configuration for Plants R9 than to the newer
FastR10
. This seems to be the case, or maybe is just a lucky strike.Anyway, since I used an old configuration and R10 is remarkably better than R9 (even for plant genomes), I think there is still room for improvement, so I wanted to ask your help to further configure Shasta to better fit this data. This is the command line I used for the last assembly:
I added the
--Assembly.consensusCaller Modal
because it is what is being used in more recent configurations (i.e.FastR10
), is that reasonable to you?These are the Assembly summary and the log: AssemblySummary.html.txt stdout.log
So, how would you further tweak the configuration to improve that assembly? As the current configuration is expecting noisier R9 reads, one way to go is to be more strict with the quality of the reads/alignments since the error rates for R10 are better. Which parameters specifically would you tweak?
Thank you in advance. Sivico
P.S. I am amazed at how fast
Shasta
is compared to other assemblers. I am seeing hundreds of hours needed for other assemblers compared to just a few forShasta
. How did you achieve that? Anywho, big kudos for that.