Removing redundancy between statistical_test/differential_analysis

motleystate / moonstone

Library to perform Metagenomics data analysis with Python

https://moonstone.readthedocs.io/en/latest/?badge=latest

MIT License

1 stars 0 forks source link

Removing redundancy between statistical_test/differential_analysis #91

Closed AgnesBaud closed 7 months ago

AgnesBaud commented 2 years ago

Description

statistical tests in both statistical_test.py and differential_analysis.py
SeriesBinning: bin into equal-size bins (in addition to equal-width)

AgnesBaud commented 2 years ago

SeriesBinning Class

compute_heterogenousbins: compute ~logish\~_ bins (exemple: if maximum=123, bins=[-0.1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200])
compute_homogeneous_bins ➔ if number of bins given, pd.cut is enough for "equal-width" and pd.qcut is enough for "equal-size"; only use case is if you want intervals to start at 0 (cleaner in plots)

AgnesBaud commented 2 years ago

SeriesBinning Class

When should nbins and cut_type or bins_values and farleft arguments be defined ?

at instantiation?
manually declared by user?
when calling compute_binned_data or binned_data?

Filtering and normalization classes do it at instantiation (but also possible to manually change it later)

The difference is that the farleft argument is directly linked to the bins_values argument since it modifies it (if farleft==True, remove -0.001 to first element of the list so that value is included in the first bin). So change it later would not give the same result

AgnesBaud commented 2 years ago

_compute_best_bins_values method

_(_compute_best_bins_values is used to make a table taken by chi2contingency)

➔ argument cut_type (in SeriesBinning) customizable in this method ➔ More than 1 series, should we look at all series at once?

No problem in looking at all series for equal-width bins

For equal-sized bins, looking at all series as one big series, might results in some cells of the contingency table being < 5 for some series (rows)

For Chi2 contingency, is it important to have the same bins for every comparison? ➔ I don't think so

AgnesBaud commented 2 years ago

Chi2-related functions

~_compute_best_bins_values~ function ➔ removed
chi2_contingency function ➔ takes series1 and series2 (etc.) as arguments, following the model of the ttest_independence and mann_whitney_u functions
all compute_contingency_table functions ➔ takes categorical_series and numerical_series (etc.) OR dataframe, categorical_column, numerical_column (etc.) as arguments, which is what pd.crosstab uses to actually create the contingency table + allows the categorical_series to have more than 2 different values