qiime2 / cancer-microbiome-intervention-tutorial

JupyterBook for QIIME 2 FAES January 2022 workshop
Other
11 stars 15 forks source link

Modify taxonomy filtering command in phylogeny tutorial #36

Closed mikerobeson closed 2 years ago

mikerobeson commented 2 years ago

Given the filtering-step as outlined here, I'd recommend using the following command, or a variant of it, which I pulled from this post:

qiime taxa filter-table \
    --i-table table.qza \
    --i-taxonomy taxonomy.qza \
    --p-mode 'contains'  \
    --p-include 'p__' \
    --p-exclude 'p__;,Eukaryota,Chloroplast,Mitochondria,Unassigned,Unclassified' \
    --o-filtered-table ./table-no-ecmu.qza

Note that I set --p-exclude 'p__;,... . This is more explicit at removing taxa that have only the p__ rank, i.e. no accompanying taxonomic label. That is, --p-include 'p__' will keep k__Bacteria; p__Proteobacteria; as well as any data that has an empty phylum rank such as k__Bacteria; p__;. Which technically has no phylum classification.

Yes, the command above --p-include 'p__' might be redundant and not needed with the given exclude command. I only place it there for the sake of completeness and explicitness for teaching the difference between p__ and p__;. :-)

Or simply mentioning that it is recommended that plastid / organellar, and perhaps even host sequences be removed. Especially, when considering that mitochondria are a "family" within the phylum Alphaproteobacteria, and chloroplasts are a "class" within the phylum Cyanobacteria. So, if the user does not look at the family or class level they may inadvertently retain these sequences.

NOTE: This is presented out of order in reference to the workshop schedule. That is, the material for taxonomic classification occurs after the phylogeny bit. So, perhaps this should be mentioned as something to consider later on to avoid user confusion? That is something like "If you already have taxonomy information you can also perform additional filtering like so..."

gregcaporaso commented 2 years ago

Thanks @mikerobeson - I agree with these suggestions and 'll make this change.

NOTE: This is presented out of order in reference to the workshop schedule.

We're planning to adjust the workshop schedule so that taxonomy is assigned before the phylogenetic reconstruction step so we can apply this step. Sorry for the confusion!

gregcaporaso commented 2 years ago

@mikerobeson, I'm thinking of adapting the --p-exclude value to --p-exclude 'p__;,Chloroplast,Mitochondria'. Does that work for you?

Since we're using Greengenes here, the other terms you have in there shouldn't hit anything (please correct me if I'm wrong about that). I think keeping 'Eukaryota' might be confusing since GG only annotates bacteria and archaea. Also, would your filter toss sequences that had valid (say) family assignments but were labeled 'Unassigned' at the genus level?

mikerobeson commented 2 years ago

Good points @gregcaporaso! The reason why I added both Unassigned,Unclassified is that one of the two works with classified SILVA output, and the other with classified Greengenes output... err.. or something like that? I know that when I pull up a barplot from some my datasets, I'd have in Level 1 of my barplot, the following list: k__Bacteria, k__Archaea, Unclassified

There might be potential for either of the terms, Unassigned or Unclassified, to be within a lower level taxonomy in the form of g__unclassified_thingy. I do not think I've observed that and I could be totally wrong. But I have seen g__uncultured_thingy, etc... Either way I think what you propose works well.

Perhaps the filtering could be safely done in two steps. The one you propose, followed by a similar command run on that output with --p-mode 'exact' --p-exclude 'Unassigned,Unclassified'. Perhaps add as a Note / Tip?

gregcaporaso commented 2 years ago

Thanks @mikerobeson. I just made some edits in #38. I checked this data set and didn't notice the "Unclassified" issue that you mentioned, but you're right - I've definitely seen that before. I just checked and it actually pops up in the Moving Pictures tutorial. Since it's not an issue with this data set, I opted instead to use it as an opportunity to plug the forum. I added a note admonition that talks about filtering on these terms, and linked the forum post you reference in this issue to let readers know they can find useful suggestions by reading the forum.

gregcaporaso commented 2 years ago

Oh, and I just realized that --p-include 'p__', which is part of this command, would filter the features that are exclusively labeled "Unassigned" (as in the Moving Pictures tutorial dataset) - so maybe that's why it's not coming up here.

mikerobeson commented 2 years ago

Oh right! Now that you mention it, that must be the other reason why I've recommended the combination --p-include 'p__' --p-exclude 'p__;'. I totally misremembered my own reasoning, or just confused myself at some point. Ack! So... the Unassigned,Unclassified can go. Hah!

gregcaporaso commented 2 years ago

Addressed in #38.