nf-core / ampliseq

Amplicon sequencing analysis workflow using DADA2 and QIIME2
https://nf-co.re/ampliseq
MIT License
180 stars 113 forks source link

Add classifier artifacts to Docker image? #113

Closed erikrikarddaniel closed 4 years ago

erikrikarddaniel commented 4 years ago

I've noticed that the SILVA classifier is built on each run of the pipeline, which seems like a waste of resources to me. Wouldn't it be better to add the artifacts to the Docker image? I realize this would both make the image larger and complicate the build procedure. Moreover, we will need to add support for several other databases.

d4straub commented 4 years ago

The best performance of the current classification method requires that it is trained for the primer pair that was used for amplicon generation. So that each primer pair needs a different classifier (but a previously trained classifier can be used as well to skip classifier training). However, the performance of a trained classifier on full length 16S sequences is not much worse, according to QIIME2 forum, I haven't tested that myself. So I can't recommend using one standard SILVA classifier.

But maybe it would be of advantage to use another classification method, SEPP. I haven't investigated SEPP in depth, probably that method does not require training for each primer pair.

edit: clarification

erikrikarddaniel commented 4 years ago

You're right, but there are more things to consider. First of all, we have a lot of people using the same primer pair in Sweden (V3V4) so a specially trained classifier for that would be useful. Second, a lot of people follow the advice you cite: full length 16S works fine. Third, there are other databases: GTDB, PR2, UNITE etc.

In our project we have promised to supply some different databases for 16S and eukaryotic long reads, i.e. part of SSU, ITS and LSU. One could of course arrange a web server where the data is collected and then fetched by the workflow, but creating the artifact takes forever, and it seems unnecessary to do this over and over again. At the same time, I can understand that you don't want very large Docker images that takes forever to build. As I see it, the best thing would perhaps be if we stored some ready made classifier artifacts on a server, which the user could choose between, plus give the option to supply the trained data as an artifact. In the latter case we would of course point to instructions for how to create an artifact or build a separate workflow for that.

d4straub commented 4 years ago

The variety of databases including varying primers and versions make integrating a trained classifier into the container less desirable in my opinion. If you repeatedly use the same database, train it once, save it so that you have access, and specify it then with --classifier for all subsequent analysis. If you would like to use full length databases, here are Training a classifier with SILVA v132 takes around 2 hours, I wouldn't consider this qualifying as forever. A few environmental samples (i.e. diverse taxa) will need easily 12 hours in DADA2. So training the classifier is absolutely no limiting step in my experience.

However, I totally agree having more choice on reference databases would be good. Some time ago I opened an issue but never had the immediate need to implement it, see #23. Here is a collection of QIIME2 databases (that still need to be trained).

d4straub commented 4 years ago

I think we concluded that we do not add classifiers to the docker image, but adding more choice for reference databases is desirable, already issued in #23 .