feature request - norm or best parameter ranges in stdout.log

colindaven commented 1 year ago

Hi,

I think it's sometimes a bit difficult for inexperienced users to tune Shasta.

It would be very nice to have the following commented on in the stdout / .log if possible.

Current msg
Average number of alignment candidates per oriented read is 116.784.

Suggested msg
Average number of alignment candidates per oriented read is 116.784 (Optimal range 100-1000 or whatever)

Similarly, the use of --Reads.minReadLength x to reduce coverage could be calculated and reported in the log, and a warning given if coverage would go below the optimal range for that particular config.

Also, it might be more beneficial if the config or parameter is suggested which does automatic parameter selection --ReadGraph.creationMethod 2 could be flagged up in log, with an indication that --ReadGraph.creationMethod 0 might be what the user wants.

Thanks, I think these usability improvements might make the tool more manageable, for those occasions when an out-of the box config does not yield good results.

Also, one of the key problems for us, with considerable variability, has been read quality. A recommendation for users to test 10k reads by minimap2 alignment and cramino stats might be useful, if they are getting substandard assemblies for non-human organisms.

I'm afraid I still have no public data I can offer with respect to plant genomes sequenced with the 10.4.1 ONT pore. The only one I will hopefully soon be able to offer is also just Arabidopsis, which has a small genome of roughly 100 MB and is very repeat-poor, so is not a good model for most plants. That said, it is still the model plant.

Thanks again, Colin

paoloshasta commented 1 year ago

This is certainly an issue, but the problem is that I generally don't have good criteria that I feel confident about. A few years ago @bagashe created a script that you could run after an assembly completes to do some types of assessment of the kind you suggest. The script used the information in AssemblySummary.json (a machine-readable version of AssemblySummary.html) and some additional information provided by the user (like expected genome size) and attempted to make suggestions to improve assembly quality. However I found it impossible to continue to maintain that script due to the variety of different situations in which Shasta is used, due to different genome characteristics, data type, coverage, and assembly configuration selected.

The reality is that today, like you point out, to get the best possible assembly you often need some knowledge of the computational methods used in Shasta. It is true that in many case the built-in assembly configurations result in reasonable "out of the box" assemblies, but when that is not the case, iteration is necessary, and that requires some intuition on what to change. I am always available to help with this process, and I have guided several uses who filed issues through it, but I realize that in an ideal world we would like something better.

If you or others want to provide code or documentation that could help with this (not necessarily on a Pull Request), I would be happy to add it to the Shasta code base and integrate it as appropriate. For example, an experienced Shasta user such as you could provide a write up that attempts to summarize their know-how. I would by happy to add something like this to the Shasta documentation, with proper credit of course.

I am also developing new methods to extract assembled sequence from the Shasta marker graph. These methods use the paradigm of "follow the reads", and the main motivation is to do a better job at resolving segmental duplication and other hard regions in human genomes. But my hope is that they will also be more robust and generally less sensitive to assembly conditions, partially alleviating he issue you bring up.

colindaven commented 1 year ago

I see. Thanks for your excellent and informative answer.

I would attempt to provide some sort of report on Shasta from a user perspective, but feel my understanding of the algorithms and settings are so limited that this might not be so useful, or require heavy editing from your side. The docs are indeed very useful and should be re-read frequently, which I do.

I'm also aware that Nanopore data quality has dramatically improved in plants in the last 2-3 years so I see this idea as a difficult moving target, yet probably still worth attempting - I'll try to put something simple together, for what its worth.

It is very good to hear that you are attempting to improve the robustness of the continuity of assemblies to the parameter values chosen. Would you recommend using the current shasta commit to take advantage of these new methods, or sticking to the latest release (which I have been doing)?

Thanks

paoloshasta commented 1 year ago

Given that you routinely assemble very large genomes using Shasta, you are probably one of the most expert Shasta users, and therefore a write up summarizing your know-how would be very useful. I encourage you to write one in any form that you are comfortable with. I will add it to the docs directory and link to it from the main documentation page, giving you proper credit of course. Like the rest of the documentation, it does not need to be polished to be useful.

When you are ready to contribute something, feel free to contact me at the e-mail address in my GitHub user page so we can discuss details.

I, and I am sure other Shasta users, would feel grateful for your contribution, and I also encourage others who may be reading this to also contribute their know-how.

paoloshasta commented 1 year ago

And you should continue to use the latest Shasta release. Work on new assembly methods is in progress and is not ready for prime time, and is not activated unless you use special configuration parameters. Therefore there would be no benefit to just switching to the latest code on GitHub.

I will make sure to create a new release as soon as there are user-visible benefits, but that will still be some time.

paoloshasta commented 11 months ago

I am closing this due to lack of discussion. Please open a new issue if additional topics arise.

paoloshasta / shasta

feature request - norm or best parameter ranges in stdout.log #13