Closed AgnesBaud closed 2 years ago
_get_dataframe()
takes 3 arguments:
taxa_level
: Taxonomy level ➔ rows of a same taxa are summed up; if taxa_level is equal to the lowest taxonomical level, it only drop the higher taxonomical levels; MultiIndex becomes (single) Indexcount_threshold
: Threshold of counts for the species to be considered in the sample ➔ rows with a count number < count_theshold are set to 0mean_taxa
: Mean threshold for a species to be kept for analysis ➔ remove rows with a mean among samples < mean_taxaWhat order for these steps ? Should the count_threshold and mean_taxa "filtering" be done at the lowest taxonomical level (usually species) or at the taxonomical level of interest ?
Also in the cases of plot_most_abundant()
and plot_sample_composition_most_abundant_taxa()
, where we are dealing with relative abundance, should we do the relative abundance before or after the "filtering" steps?
➔ With Kraken2, we don't have a "Unknown" row, only reads that have been classified are in the count dataframe. So relative abundance would already be only regarding classified reads. Here, if we remove species that have a low mean among samples, because it could be a false positive. Then the reads become unclassified and shouldn't be included in the relative abundance dataframe. That is why I would compute the relative abundance dataframe on the "filtered" dataframe.
_get_dataframe()
takes 3 arguments:* `taxa_level`: Taxonomy level ➔ rows of a same taxa are summed up; if taxa_level is equal to the lowest taxonomical level, it only drop the higher taxonomical levels; MultiIndex becomes (single) Index * `count_threshold`: Threshold of counts for the species to be considered in the sample ➔ rows with a count number < count_theshold are set to 0 * `mean_taxa`: Mean threshold for a species to be kept for analysis ➔ remove rows with a mean among samples < mean_taxa
What order for these steps ? Should the count_threshold and mean_taxa "filtering" be done at the lowest taxonomical level (usually species) or at the taxonomical level of interest ?
I'd say: count_threshold ➔ mean_taxa ➔ taxa_level ➔ relative abundance
This is a plotting method, so filtering should be done outside of this method
➔ removing count_threshold
➔ mean_threshold
(for prevalence plot)/average_relative_abundance_threshold
(for abundance plot) and prevalence_threshold
done after taxa_level (and relative abundance for abundance plot) step(s). It isn't filtering on the dataframe but rather filtering on the top species to show.
For abundance plot, we can either plot the top X (
taxa_number
) species or all species above a threshold (average_relative_abundance_threshold
). We can also specify that only species present in at least X% of the samples (prevalence_threshold
) should be consider to constitute the top species.Similarly, for prevalence plot, we can either plot the top X (
taxa_number
) species or all species above a threshold (prevalence_threshold
). We can also specify that only species with a mean > X (mean_threshold
) should be consider to constitute the top species.
Description
higher_classification
argument instead of adding "-only" at the end of thetaxa_level
argument_get_dataframe()
methodplot_most_abundant()
andplot_sample_composition_most_abundant_taxa()
)