Best practice for ONT metagenomics

ezherman commented 4 months ago

I am in the process of determining whether sourmash is the right tool for my metagenomic analyses. I am mostly using ONT data at the moment. Similar to previous posts (e.g. https://github.com/sourmash-bio/sourmash/issues/2236), I have found that sourmash cannot classify most of the kmers.

I was wondering whether there is a "best practice" for using sourmash with ONT data. To aid the discussion, I thought I'd list all the potential processes and parameters that I think may help. Before trying to optimise them all, I thought I'd check whether others have had any success!

The following seem relevant:

The effect of the average read Q score. Given comments that suggest the low classification rate is due to sequencing errors (e.g. https://github.com/sourmash-bio/sourmash/issues/2236#issuecomment-1317085689), I have been considering filtering for, say, an average read Q score of 20. However, this can massively reduce the number of reads.
Effect of k-mer size. The main sourmash guide suggests using k=31 or k=51, with the latter recommended for strain-level resolution (here). I reckon this recommendation came out of Illumina data. In contrast, one suggestion was to choose k as small as possible given the average genome size for an optimal sensitivity/specificity trade-off (https://github.com/sourmash-bio/sourmash/issues/2360#issuecomment-1317586603). For a 6Mb genome, this would come to k=11 or k=13.
Effect of k-mer abundance. Sourmash sig filter allows for removal of kmers of a certain abundance (e.g. 1), which are more likely to have come from sequencing errors.
Effect of --threshold-bp: computational requirements go up with a lower threshold, however I understand this to also increase sensitivity (https://github.com/sourmash-bio/sourmash/issues/2360#issue-1446498334).

If anyone has had success with these adjustments on ONT data, or is aware of any other modifications, I would really appreciate your input.

ctb commented 4 months ago

Hi @ezherman I have nothing but intuition to offer, I'm afraid! Hopefully some others will step up with some better experience!

I am in the process of determining whether sourmash is the right tool for my metagenomic analyses. I am mostly using ONT data at the moment. Similar to previous posts (e.g. #2236), I have found that sourmash cannot classify most of the kmers.

Right! I would expect that!

My question is, do you need most k-mers to be classified, or would you be satisfied with having enough k-mers be classifiable to get accurate results?

There are two, mmm, schools of thought on taxonomy. One is taxonomic binning, where you are trying to assign each sequence to a specific taxonomic group. The other is taxonomic profiling, where you are trying to identify taxonomic groups present in your entire data set.

sourmash has (so far) mostly been used for taxonomic profiling, where it is very accurate for reasons I think we can explain (combinations of k-mers, basically).

I would not suggest using sourmash directly for binning (yet), for a few reasons - the main one being that we are only just now getting to the point where it performs well with the lower scaled value needed. (See the branchwater plugin for our rather successful efforts to improve the performance of gather).

However, if you want to get binning-like results, there's a third option here, and it's one you might consider: sourmash actually does genomic profiling of the composition of a metagenome with sourmash gather, and then "promotes" that to a taxonomy with sourmash tax. You could generate a genomic profile of your entire metagenome against your Very Large Database, then map your ONT reads directly to the resulting genomes. Depending on what you want, this might get you there - e.g. if you wanted to find the relevant genomes (and hence taxonomies) for your data set, you will find them this way.

We have a workflow that implements this for Illumina, called genome-grist. Happy to chat more about that here or there.

I was wondering whether there is a "best practice" for using sourmash with ONT data. To aid the discussion, I thought I'd list all the potential processes and parameters that I think may help. Before trying to optimise them all, I thought I'd check whether others have had any success!

(thank you for this summary!)

The effect of the average read Q score. Given comments that suggest the low classification rate is due to sequencing errors (e.g. Improving results for nanopore #2236 (comment)), I have been considering filtering for, say, an average read Q score of 20. However, this can massively reduce the number of reads.

Agreed, this is a bad idea!

Effect of k-mer size. The main sourmash guide suggests using k=31 or k=51, with the latter recommended for strain-level resolution (here). I reckon this recommendation came out of Illumina data. In contrast, one suggestion was to choose k as small as possible given the average genome size for an optimal sensitivity/specificity trade-off (adjust default thresholding for prefetch and gather? #2360 (comment)). For a 6Mb genome, this would come to k=11 or k=13.

I was going to say k=15 ;). I agree. The only challenge here is you'd need to build your own database at the k-mer size(s) of interest. That has become much faster and easier with manysketch from the branchwater plugin, but it's still an extra step.

Effect of k-mer abundance. Sourmash sig filter allows for removal of kmers of a certain abundance (e.g. 1), which are more likely to have come from sequencing errors.

Yep. It's likely to reduce your computational costs and maybe your sensitivity, but should not have any effect on specificity.

Effect of --threshold-bp: computational requirements go up with a lower threshold, however I understand this to also increase sensitivity (adjust default thresholding for prefetch and gather? #2360 (comment)).

Yep. I would say that the computational requirements are less of an issue now, with the branchwater plugin.

If anyone has had success with these adjustments on ONT data, or is aware of any other modifications, I would really appreciate your input.

If you can identify a simple benchmark data set for us to try out (Maybe the one from Portik et al would work - @bluegenes ?) we could potentially do some of this for you...

ctb commented 4 months ago

Just a quick follow-up - on re-reading the above, I wanted to make it clear: unknown/non-matching k-mers are not a problem for sourmash. They are just ignored, except in some summary statistics.

You might just try running at k=21 (for which databases are available) and see what happens.

ezherman commented 4 months ago

Thank you @ctb for your detailed response! Below some thoughts/comments.

I appreciate your overview of profiling vs mapping vs profiling + alignment. This is a great overview of the options, thank you! At first, I am looking for a taxonomic profile that accurately represents the species in my sample, to the extent in which they are represented in my database. Secondly, I do hope to map reads back to the genomic profile reported by sourmash. Genome-grist looks great, and I would hope to implement something similar for ONT data. But for now, my focus is on improving the genomic profile. I would be happy to continue the conversation regarding alignment on the genome-grist page.

Thank you for highlighting manysketch, and the branchwater plugin more generally. I will play around with k=21 for now, but I will check out manysketch if I do try to create databases at shorter kmer lengths.

Having played a bit with fastgather, it looked like the output csv is more limited in its set of columns, is that right? I find the output of gather very useful, but I guess fastgather in its current form could serve the purpose of obtaining a minimum metagenome cover for downstream mapping (genome-grist style). It’s great to see how much faster it is than gather.

As for single-abundance kmers: so far, I have found that removing single-abundance kmers massively reduces the size of my signature file (e.g. ~11mb to ~650kb). But I haven’t yet tested this on many samples, nor in any systematic way.

If you can identify a simple benchmark data set for us to try out (Maybe the one from Portik et al would work - @bluegenes ?) we could potentially do some of this for you...

It would be great to work together on this! Most of my samples are from undefined communities, so they may not be ideal for benchmarking. I’d be very interested to hear if @bluegenes has any suggestions.

ctb commented 4 months ago

Thank you @ctb for your detailed response! Below some thoughts/comments.

I appreciate your overview of profiling vs mapping vs profiling + alignment. This is a great overview of the options, thank you! At first, I am looking for a taxonomic profile that accurately represents the species in my sample, to the extent in which they are represented in my database. Secondly, I do hope to map reads back to the genomic profile reported by sourmash. Genome-grist looks great, and I would hope to implement something similar for ONT data. But for now, my focus is on improving the genomic profile. I would be happy to continue the conversation regarding alignment on the genome-grist page.

Sure, here or there. genome-grist is kind of suffering from neglect at the moment and needs to be updated...

Thank you for highlighting manysketch, and the branchwater plugin more generally. I will play around with k=21 for now, but I will check out manysketch if I do try to create databases at shorter kmer lengths.

👍 it's quite a game changer in terms of speed!

Having played a bit with fastgather, it looked like the output csv is more limited in its set of columns, is that right? I find the output of gather very useful, but I guess fastgather in its current form could serve the purpose of obtaining a minimum metagenome cover for downstream mapping (genome-grist style). It’s great to see how much faster it is than gather.

So, three thoughts here - and first, I feel like we need to apologize for the construction mess while we're "in transition" -

you can use the fastgather results as a picklist to generate full gather results pretty easily;
I built a script calc-full-gather over here that actually does it directly, too;
and the real answer, fastmultigather with a rocksdb index actually produces a complete set of gather results, because @bluegenes needed it and put the time into implementing it 😅 . This is still buyer-beware code in terms of setting it up and running (we've had some reports that fastmultigather is acting up ref ref ref, but the results are legit once you get them!)

As for single-abundance kmers: so far, I have found that removing single-abundance kmers massively reduces the size of my signature file (e.g. ~11mb to ~650kb). But I haven’t yet tested this on many samples, nor in any systematic way.

I personally don't like removing low-abundance k-mers, since it reduces sensitivity ;).

If you can identify a simple benchmark data set for us to try out (Maybe the one from Portik et al would work - @bluegenes ?) we could potentially do some of this for you...

It would be great to work together on this! Most of my samples are from undefined communities, so they may not be ideal for benchmarking. I’d be very interested to hear if @bluegenes has any suggestions.

:)

ezherman commented 4 months ago

you can use the fastgather results as a picklist to generate full gather results pretty easily

Ah, great! I will give this a try.

I personally don't like removing low-abundance k-mers, since it reduces sensitivity ;).

Makes sense. It would be a shame to miss out on low abundance organisms, so it probably won't be my approach either.

ctb commented 4 months ago

I've written a short fastmultigather quickstart here: https://github.com/sourmash-bio/sourmash/issues/3095.

ctb commented 1 month ago

note: as of sourmash_plugin_branchwater v0.9.5 link, the results from fastgather and fastmultigather are now identical to those from sourmash gather. So a lot of the caveats above are now moot! 🎉

ezherman commented 1 month ago

Thank you @ctb! I hope to make time soon to implement the latest version of the plugin.

sourmash-bio / sourmash

Best practice for ONT metagenomics #3070