note regarding region-specific classifiers

gregcaporaso commented 3 months ago

@nbokulich, @mikerobeson, @BenKaehler - how important do you all think the region-specific classifiers are? My impression is that they offer fairly small classification performance gains relative to the full-length classifiers, and definitely smaller performance gains relative to the weighted classifiers. I'm wondering if we should be de-emphasizing the region-specific classifiers, and putting more emphasis on the weighted classifiers.

nbokulich commented 3 months ago

I agree with you @gregcaporaso . The performance gains of region-specific classifiers are minimal (a few % accuracy). Users can always generate their own — and the memory footprint for training/classifying with these is lower as well, so should be manageable for many users who do want to increase their accuracy a little bit. Besides, V4 is the standard but a bit old, so why we offer only V4-specific and not, e.g., V3-V5 or V4-V5 that are increasingly used. On the other hand, many users will probably be disappointed if the V4 classifiers go away.

In an ideal world, it would be nice to have a catalog of classifiers for different commonly used regions if the workload to maintain these is not high (but I am assuming it is, otherwise you would not be asking).

If I had to choose a limited number of classifiers to host, I would rather focus on different markers/databases. UNITE and GTDB would be key resources to add, and probably more important than hosting V4-specific classifiers.

And yes, I would emphasize weighted classifiers over region-specific classifiers, as the performance gain is simply better.

mikerobeson commented 3 months ago

I partially agree with this. In my view, the main inherent advantage of making the region-specific classifiers is to decrease the file-size and memory footprint. The latter being more of an issue for some users, at least those that I meet in my various class sessions and local workshops. They simply are unable to use the full-length SILVA classifier for example. However, they are able to download the amplicon specific classifier and move ahead. Sometimes it is hit or miss when they try to train their own amplicon classifier, some users have laptops with only 8 - 16 GB RAM, and do not have access to other resources.

QIIME 2 is being increasingly used for non-microbial analyses, i.e. eDNA surveys leveraging other marker gene data from GenBank. The problem is that many of the target marker sequences are contained within larger sequence records, and often the best approach is to make a region specific classifier. Otherwise, the reference database might contain data from outside the target marker gene region, i.e. contain data from other neighboring gene sequences. Hence this approach. My point is that we should be careful not to deemphasize the utility of region specific classifiers too much, as it is dependent on the nature of the data / database from which a user is pulling and extracting sequences from. At least we can be more specific in the technical reasons why making one would be beneficial (e.g. use less resources).

I agree that we can point users to the RESCRIPt tutorials, etc... if they would like to construct their own region-specific classifiers. Though, that being said... is there anything to really maintain once our refdb nextflow pipeline goes public? Right now it will generate a full-length and several region-specific classifiers for GTDB, SILVA, and RDP with one command. I plan to add SILVA LSU, and UNITE to that pipeline too.

I agree providing GTDB and UNITE. In fact, I recently submitted a PR to RESCRIPt to fetch the latest version of GTDB (v220) in RESCRIPt. But we still need to keep SILVA for the eukaryotic 18S.

I should say, that I have noticed that GTDB tends to classify chloroplast sequences incorrectly as legitimate bacterial sequences. Though I've not looked see if this is still the case with with v220. I've been experimenting with using SILVA to find / remove mitochondrial, chloroplast, and outgroup eukaryotic sequences, remove them, then classifying again with GTDB. Remember, even with GTDB, it is still possible to spuriously classify a read as Bacteria / Archaea when that sequence is really an eukaryote. It is more rare to classify the eukaryote even down to a bacterial phylum, but it happens. Hence my, SILVA first, and GTDB second. Obviously, GTDB will improve in time, and we can always provide another tutorial on adding outgroups to GTDB.

Okay, I am so sorry for this long winded response. To clarify, I agree with @gregcaporaso and @nbokulich, I just wanted to bring up some caveats to consider. I also support the weighted classifiers for SSU. Regardless of what is decided I'll continue to work on the amplicon region refdb nextflow pipeline, as I think it'd still be useful.

nbokulich commented 3 months ago

the main inherent advantage of making the region-specific classifiers is to decrease the file-size and memory footprint

ah really good point!

gregcaporaso commented 3 months ago

Yes, all really good points! If anyone wants to issue a PR to this repo to update the note with some of these caveats, that's more than welcome!

For context, this repo and the corresponding website (https://resources.qiime2.org) are replacing our links to classifiers in the QIIME 2 docs in the 2024.5 build. There will be a link on that page redirecting users (here's that PR). We wanted to decouple these links from the docs to make them easier to edit or add to without a full rebuild of the docs.

qiime2 / resources

note regarding region-specific classifiers #7