picrust / picrust2

Code, unit tests, and tutorials for running PICRUSt2
GNU General Public License v3.0
317 stars 104 forks source link

Qiime2 & correction method & null reference #130

Closed JiqiuWu closed 3 years ago

JiqiuWu commented 4 years ago

Hi,

Thanks you guys in advance!

I met a couple of questions of the PICRUSt2 (hopefully not stupid), could you please give me some clue if you're free?

(1) When preforming PICRUSt2 in Qiime2, how could i know the ASV contribution of each predicted function and the NSTI value of each pathway, which parameter could make it? (2) You mentioned that "As in PICRUSt1, ASVs are corrected by their 16S rRNA gene copy number and then multiplied by their functional predictions to produce a predicted metagenome". Which correction method you used and why? Did you use negative binomial regression or zero inflation negative binomial regression? (3) You used a "null reference", but i don't understand how did you make it and the purpose you used it?

Many thanks, Jiqiu

gavinmdouglas commented 4 years ago

Hi @JiqiuWu,

1) Unfortunately you can't get that output for with the QIIME 2 plugin, but you can with the standalone version. You can see the tutorial for examples.

2) This correction isn't based on a distribution but instead based on predicting 16S copy number like any other trait and then dividing the relative abundance of each ASV by the predicted copy number.

3) The "null" category in the PICRUSt2 paper is the concordance with shotgun metagenomics data you would get if predictions were just taken to be the mean gene abundance across all reference genomes. This category highlights that you get seemingly high concordance just because certain genes are just always more common (e.g. housekeeping genes). We felt it was important to compare to this to get a sense of the true baseline of how well you would expect random predictions to perform.

Hopefully that helps!

JiqiuWu commented 4 years ago

Thanks for your fast reply! @gavinmdouglas Your answer definitely helps me a lot!

But I have several further questions: (1) In PICRUSt2 paper, when you "compare the results of differential abundance tests on 16S-predicted metagenomes to MGS data", which method did you use to find out the differential abundance, LEFSE, edgeR or something else? And why you used it? In addition, you only mentioned KO results, have you compare other functional profiles, like MetaCyc or KEGG? How about their F1 scores?

(2) "All prediction tools displayed relatively low precision" when comparing the results of differential abundance. What factors result in this situation you think? The precion of prediction is not reliable, the statistical method is not reliable, or something else?

(3) Do you think we can make hypothesis based on the results of PICRUSt2 and the differential analysis by LEFSE downstream, and then design some animal experiments to verify the hypothesis?

Many thanks, Jiqiu

gavinmdouglas commented 4 years ago

No worries!

1) You should take a look at the supplementary materials for details on these analyses, but that analysis is based on Wilcoxon tests after normalizing the data with MUSiCC. This method isn't necessarily better than others, but we essentially hypothesize that it may be more reliable. You can see other statistical methods tested and the F1 scores in the supplementary materials (and the results for MetaCyc pathways reported there as well, which are described in the main-text but not plotted). Note that I wouldn't recommend LEFSE or edgeR for microbiome analyses in general, in my experience these tools result in many more significant hits than other methods (which makes me think they are false positives) when applied to ASV data.

2) I'm not sure what the relative contribution of the technical factors is unfortunately, but certainly both you mentioned are likely important contributors.

3) I would expect LEfSE to perform even worse than the statistical methods we tried in the manuscript so I think it would be very difficult to reliably identify any individual significant difference based on this datatype alone you would like to confirm with experiments. When digging into any individual gene family or pathway it would be important to have some other reason for why you think it might differ based on an experiment.