Why a worse assembly is obtained when more read files are assembled?

ElviraTo commented 1 year ago

Hi, I really appreciated your assembler and I have used it for assembling an eukaryotic genome. When I tried and assemble my sequences contained in a single file I obtained a quite good and contiguous assembly (about 5000 contigs with a N50 of about 13 Mb). However, when I tried to assemble four files containing reads from four sequencing of the same individual, I obtained a worse result, about 90000 contigs with a N50 of about 58000 bases. Could you help me in solving this problem and improving my results? To specify multiple input files, I entered them separated by space after "--input". Are there other parameters which I can set? Thanks.

paoloshasta commented 1 year ago

There are several possibilities for why this might be happening. To allow me to investigate, please post here AssemblySummary.html for the successful assembly with 13 Mb N50 and for the unsuccessful assembly that used all of the four input files. Also please post the expected genome size, if available. Finally, it would be useful to have some information about the reads used - that is, whether ONT R9 or R10, and the base caller version and "model" used for base calling (that is, whether "super" accuracy or not).

Depending on the diagnosis, it may be possible to tweak some assembly parameters to improve the higher coverage assembly.

The way you specify the four input files is correct.

ElviraTo commented 1 year ago

This is the result for the successful assembly: Shasta_assembly_summary_run1.pdf

And the following is the result for the assembly of the four runs. Shasta_assembly_summary_4runs.pdf

The expected genome size is about 2.7-2.8 Gb. Reads are ONT R9, the used basecaller is Guppy 6.3 and the "super accurate" model has been used. Thank you so much!

paoloshasta commented 1 year ago

The assembly configurations that come with Shasta are optimized for coverage around 40x to 80x. Your second assembly used 386 Gb of coverage, corresponding to around 140x for that genome size.

If you look at the assembly summary, you see that for the second assembly the read graph only contains 22 million alignments for 34 million reads, and as a result 73% of the bases in the read graph are isolated, so the assembly is working effectively at very low coverage (ironically). In contrast, in the first assembly the read graph contains 32 million alignments for 8 million reads, and as a result the fraction of isolated bases in the read graph is only a healthy 14%. This is caused by the automated adaptive selection of alignment criteria (--ReadGraph.creationMethod 2), which only works well under standard conditions at lower coverage.

It is probably possible to tweak assembly parameters to assemble successfully at higher coverage, and I can help with that if you want to pursue that option. But in your case, since the reads are pretty short (15 Kb N50), I suggest trying first an easier option that is likely to work well: just increase the read length cutoff to bring coverage used in the assembly down to around 80x, corresponding to around 210 Gb for your genome size. This will have the effect of significantly increasing the read N50 seen by the assembly. This is what I do when I have extra coverage, and in my experience this usually improves significantly the assembly N50 and its completeness.

Mechanically, there are two equivalent ways to do this:

Use --Reads.desiredCoverage to specify the amount of desired coverage. The assembler will automatically increase the read length cutoff to a value that achieves the desired coverage. In your case, --Reads.desiredCoverage 210Gb should do the job.
Alternatively, you can use --Reads.minReadLength to directly specify a read length cutoff (in bases) that will bring down coverage to the desired value. You can use Binned-ReadLengthHistogram.csv from the second assembly to help select the read length cutoff. I can't suggest a value because I don't know the read length distribution of your reads, but it could be something like --Reads.minReadLength 20000.

I suggest that you try this first and post AssemblySummary.html here. Depending on the results, we can decide if it makes sense to try the other option of optimizing assembly parameters for higher coverage.

ElviraTo commented 1 year ago

Thank you so much for your helpful suggestions! I specified the amount of desired coverage by using the parameter “—Reads.desiredCoverage” and I set it to reach a coverage of about 60x, for remaining in the range you indicated.

This is the new result for the combined assembly of the four runs: Shasta assembler summary.pdf

It is much better than the previous one, solving a lot of problems you identified before. But, although this, I still don’t see a real improvement respect to single assemblies obtained from the four single runs. I am performing other experiments for trying to improve and solve this situation. In your opinion, is there a possibility to obtain a better result than the single assemblies?

paoloshasta commented 1 year ago

It is possible that the algorithm that chooses alignment criteria adaptively (--ReadGraph.creationMethod 2) is not doing a good job picking alignment criteria. The steps to take are different depending on whether the assembly graph is "messy" or "fragmented". Did you have a chance to look at the assembly graph in Bandage? If you did, can you post a screenshot? Alternatively, please post Assembly-BothStrands-NoSequence.gfa here and I can take a look.

Also, please confirm which assembly configuration (command line option --config) you are using. For the future I plan to modify AssemblySummary.html to include this information.

paoloshasta commented 1 year ago

I am closing this due to lack of discussion - feel free to create a new issue as needed.

paoloshasta / shasta

Why a worse assembly is obtained when more read files are assembled? #12