motleystate / moonstone

Library to perform Metagenomics data analysis with Python
https://moonstone.readthedocs.io/en/latest/?badge=latest
MIT License
1 stars 0 forks source link

Refactoring and bug fixing in PlotTaxonomyCounts #87

Closed AgnesBaud closed 2 years ago

AgnesBaud commented 2 years ago

Description

AgnesBaud commented 2 years ago

_get_dataframe() takes 3 arguments:

What order for these steps ? Should the count_threshold and mean_taxa "filtering" be done at the lowest taxonomical level (usually species) or at the taxonomical level of interest ?

Also in the cases of plot_most_abundant() and plot_sample_composition_most_abundant_taxa(), where we are dealing with relative abundance, should we do the relative abundance before or after the "filtering" steps? ➔ With Kraken2, we don't have a "Unknown" row, only reads that have been classified are in the count dataframe. So relative abundance would already be only regarding classified reads. Here, if we remove species that have a low mean among samples, because it could be a false positive. Then the reads become unclassified and shouldn't be included in the relative abundance dataframe. That is why I would compute the relative abundance dataframe on the "filtered" dataframe.

$ head ~/Git_repositories/GitLab/inspire/sunbeam/20210624_kraken/sunbeam_output/qc/decontam/read_counts.tsv A4138_0001_EDME192005896-1a_HYG57DSXX_L2_1.fastq.gz 3002970 A4138_0002_EDME192005897-1a_HYG57DSXX_L2_1.fastq.gz 651252
(...)

``` kraken2_counts.sum()[['A4138_0001', 'A4138_0002']] ``` A4138_0001 2739734.0 A4138_0002 627552.0
AgnesBaud commented 2 years ago

_get_dataframe() takes 3 arguments:

* `taxa_level`: Taxonomy level  ➔ rows of a same taxa are summed up; if taxa_level is equal to the lowest taxonomical level, it only drop the higher taxonomical levels; MultiIndex becomes (single) Index

* `count_threshold`: Threshold of counts for the species to be considered in the sample ➔ rows with a count number < count_theshold are set to 0

* `mean_taxa`: Mean threshold for a species to be kept for analysis ➔ remove rows with a mean among samples < mean_taxa

What order for these steps ? Should the count_threshold and mean_taxa "filtering" be done at the lowest taxonomical level (usually species) or at the taxonomical level of interest ?

I'd say: count_threshold ➔ mean_taxa ➔ taxa_level ➔ relative abundance

AgnesBaud commented 2 years ago

This is a plotting method, so filtering should be done outside of this method ➔ removing count_thresholdmean_threshold (for prevalence plot)/average_relative_abundance_threshold (for abundance plot) and prevalence_threshold done after taxa_level (and relative abundance for abundance plot) step(s). It isn't filtering on the dataframe but rather filtering on the top species to show.

For abundance plot, we can either plot the top X (taxa_number) species or all species above a threshold (average_relative_abundance_threshold). We can also specify that only species present in at least X% of the samples (prevalence_threshold) should be consider to constitute the top species.

Similarly, for prevalence plot, we can either plot the top X (taxa_number) species or all species above a threshold (prevalence_threshold). We can also specify that only species with a mean > X (mean_threshold) should be consider to constitute the top species.