Closed mikerobeson closed 2 years ago
Thanks @mikerobeson - I agree with these suggestions and 'll make this change.
NOTE: This is presented out of order in reference to the workshop schedule.
We're planning to adjust the workshop schedule so that taxonomy is assigned before the phylogenetic reconstruction step so we can apply this step. Sorry for the confusion!
@mikerobeson, I'm thinking of adapting the --p-exclude
value to --p-exclude 'p__;,Chloroplast,Mitochondria'
. Does that work for you?
Since we're using Greengenes here, the other terms you have in there shouldn't hit anything (please correct me if I'm wrong about that). I think keeping 'Eukaryota' might be confusing since GG only annotates bacteria and archaea. Also, would your filter toss sequences that had valid (say) family assignments but were labeled 'Unassigned' at the genus level?
Good points @gregcaporaso! The reason why I added both Unassigned,Unclassified
is that one of the two works with classified SILVA output, and the other with classified Greengenes output... err.. or something like that? I know that when I pull up a barplot from some my datasets, I'd have in Level 1
of my barplot, the following list: k__Bacteria
, k__Archaea
, Unclassified
There might be potential for either of the terms, Unassigned
or Unclassified
, to be within a lower level taxonomy in the form of g__unclassified_thingy
. I do not think I've observed that and I could be totally wrong. But I have seen g__uncultured_thingy
, etc... Either way I think what you propose works well.
Perhaps the filtering could be safely done in two steps. The one you propose, followed by a similar command run on that output with --p-mode 'exact' --p-exclude 'Unassigned,Unclassified'
. Perhaps add as a Note / Tip?
Thanks @mikerobeson. I just made some edits in #38. I checked this data set and didn't notice the "Unclassified" issue that you mentioned, but you're right - I've definitely seen that before. I just checked and it actually pops up in the Moving Pictures tutorial. Since it's not an issue with this data set, I opted instead to use it as an opportunity to plug the forum. I added a note admonition that talks about filtering on these terms, and linked the forum post you reference in this issue to let readers know they can find useful suggestions by reading the forum.
Oh, and I just realized that --p-include 'p__'
, which is part of this command, would filter the features that are exclusively labeled "Unassigned" (as in the Moving Pictures tutorial dataset) - so maybe that's why it's not coming up here.
Oh right! Now that you mention it, that must be the other reason why I've recommended the combination --p-include 'p__' --p-exclude 'p__;'
. I totally misremembered my own reasoning, or just confused myself at some point. Ack! So... the Unassigned,Unclassified
can go. Hah!
Addressed in #38.
Given the filtering-step as outlined here, I'd recommend using the following command, or a variant of it, which I pulled from this post:
Note that I set
--p-exclude 'p__;,...
. This is more explicit at removing taxa that have only thep__
rank, i.e. no accompanying taxonomic label. That is,--p-include 'p__'
will keepk__Bacteria; p__Proteobacteria;
as well as any data that has an empty phylum rank such ask__Bacteria; p__;
. Which technically has no phylum classification.Yes, the command above
--p-include 'p__'
might be redundant and not needed with the given exclude command. I only place it there for the sake of completeness and explicitness for teaching the difference betweenp__
andp__;
. :-)Or simply mentioning that it is recommended that plastid / organellar, and perhaps even host sequences be removed. Especially, when considering that mitochondria are a "family" within the phylum Alphaproteobacteria, and chloroplasts are a "class" within the phylum Cyanobacteria. So, if the user does not look at the family or class level they may inadvertently retain these sequences.
NOTE: This is presented out of order in reference to the workshop schedule. That is, the material for taxonomic classification occurs after the phylogeny bit. So, perhaps this should be mentioned as something to consider later on to avoid user confusion? That is something like "If you already have taxonomy information you can also perform additional filtering like so..."