Refactoring and bug fixing in PlotTaxonomyCounts

AgnesBaud commented 2 years ago

Description

higher_classification argument instead of adding "-only" at the end of the taxa_level argument
externalizing "filtering" steps that happens at the start of the 3 plot method into a _get_dataframe() method
adapting code to non-relative abundance dataframe (-> relative abundance dataframe now computed in code, for plot_most_abundant() and plot_sample_composition_most_abundant_taxa())

AgnesBaud commented 2 years ago

_get_dataframe() takes 3 arguments:

taxa_level: Taxonomy level ➔ rows of a same taxa are summed up; if taxa_level is equal to the lowest taxonomical level, it only drop the higher taxonomical levels; MultiIndex becomes (single) Index
count_threshold: Threshold of counts for the species to be considered in the sample ➔ rows with a count number < count_theshold are set to 0
mean_taxa: Mean threshold for a species to be kept for analysis ➔ remove rows with a mean among samples < mean_taxa

What order for these steps ? Should the count_threshold and mean_taxa "filtering" be done at the lowest taxonomical level (usually species) or at the taxonomical level of interest ?

Also in the cases of plot_most_abundant() and plot_sample_composition_most_abundant_taxa(), where we are dealing with relative abundance, should we do the relative abundance before or after the "filtering" steps? ➔ With Kraken2, we don't have a "Unknown" row, only reads that have been classified are in the count dataframe. So relative abundance would already be only regarding classified reads. Here, if we remove species that have a low mean among samples, because it could be a false positive. Then the reads become unclassified and shouldn't be included in the relative abundance dataframe. That is why I would compute the relative abundance dataframe on the "filtered" dataframe.

$ head ~/Git_repositories/GitLab/inspire/sunbeam/20210624_kraken/sunbeam_output/qc/decontam/read_counts.tsv A4138_0001_EDME192005896-1a_HYG57DSXX_L2_1.fastq.gz 3002970 A4138_0002_EDME192005897-1a_HYG57DSXX_L2_1.fastq.gz 651252
(...)

``` kraken2_counts.sum()[['A4138_0001', 'A4138_0002']] ``` A4138_0001 2739734.0 A4138_0002 627552.0

AgnesBaud commented 2 years ago

_get_dataframe() takes 3 arguments:
* `taxa_level`: Taxonomy level  ➔ rows of a same taxa are summed up; if taxa_level is equal to the lowest taxonomical level, it only drop the higher taxonomical levels; MultiIndex becomes (single) Index

* `count_threshold`: Threshold of counts for the species to be considered in the sample ➔ rows with a count number < count_theshold are set to 0

* `mean_taxa`: Mean threshold for a species to be kept for analysis ➔ remove rows with a mean among samples < mean_taxa
What order for these steps ? Should the count_threshold and mean_taxa "filtering" be done at the lowest taxonomical level (usually species) or at the taxonomical level of interest ?

I'd say: count_threshold ➔ mean_taxa ➔ taxa_level ➔ relative abundance

AgnesBaud commented 2 years ago

This is a plotting method, so filtering should be done outside of this method ➔ removing count_threshold ➔ mean_threshold (for prevalence plot)/average_relative_abundance_threshold (for abundance plot) and prevalence_threshold done after taxa_level (and relative abundance for abundance plot) step(s). It isn't filtering on the dataframe but rather filtering on the top species to show.

For abundance plot, we can either plot the top X (taxa_number) species or all species above a threshold (average_relative_abundance_threshold). We can also specify that only species present in at least X% of the samples (prevalence_threshold) should be consider to constitute the top species.

Similarly, for prevalence plot, we can either plot the top X (taxa_number) species or all species above a threshold (prevalence_threshold). We can also specify that only species with a mean > X (mean_threshold) should be consider to constitute the top species.

motleystate / moonstone

Refactoring and bug fixing in PlotTaxonomyCounts #87

Description