sbslee / dokdo

A Python package for microbiome sequencing analysis with QIIME 2
https://dokdo.readthedocs.io
MIT License
43 stars 12 forks source link

correlation analysis #37

Closed khemlalnirmalkar closed 2 years ago

khemlalnirmalkar commented 2 years ago

Hi @sbslee, Does correlation (spearman/Pearson) analysis with FDR-correction fall under Dokdo's scope of work? If yes, is it possible to have a heatmap with correlation values for many factors/features or some scatter plots for individual features?

Thanks Khem

sbslee commented 2 years ago

Hi @khemlalnirmalkar,

Thanks for the suggestion! If it's a simple analysis that can be readily implemented, I'm glad to update Dokdo to support it.

That being said, to be honest, I'm still not 100% sure if I understand the type of analysis you're talking about ("correlation (spearman/Pearson) analysis with FDR-correction"). Could you please give me some concrete examples (e.g. figures from a published research article, cartoon drawings of yours)?

P.S. If you are a programmer yourself, you are more than welcome to open a PR :)

khemlalnirmalkar commented 2 years ago

Hi @sbslee, Thanks, This is what I use in R for correlation with FDR/p-correction https://microbiome.github.io/tutorials/Heatmap.html Scatter plot for correlations (see fig.2) https://journals.asm.org/doi/pdf/10.1128/Spectrum.00034-21 (Fig. 7 & 8) https://rupress.org/jem/article-pdf/218/1/e20200606/1405371/jem_20200606.pdf and attached a cartoon FDR-correction for p-values https://en.wikipedia.org/wiki/False_discovery_rate (a common method to correct false positive p-values)

Sorry, i am a microbiologist, not a programmer. I just to implement from available tutorials. Thought this could be a good thing to have in Dokdo, Please let me know if you need something Thanks

image

sbslee commented 2 years ago

@khemlalnirmalkar,

Sorry, i am a microbiologist, not a programmer.

No worries at all. Your suggestions are already greatly appreciated.

As I understand it, it appears you want to perform cross-correlation analysis between microbiome vs. some matrices for shared samples, similar to the tutorial given by Leo Lahti and Sudarshan Shetty et al. So I looked at the tutorial and devised two methods: cross_association_table and cross_association_heatmap (see their documentations here and here).

Below are examples:

>>> import pandas as pd
>>> import dokdo
>>> otu = pd.read_csv('/Users/sbslee/Desktop/dokdo/data/miscellaneous/otu.csv', index_col=0)
>>> lipids = pd.read_csv('/Users/sbslee/Desktop/dokdo/data/miscellaneous/lipids.csv', index_col=0)
>>> df = dokdo.cross_association_table(
...     otu, lipids, normalize='log10', nsig=1
... )
>>> df.head(10)
                             taxon      target      corr          pval      adjp
0      Ruminococcus gnavus et rel.  TG(54:5).2  0.716496  4.516954e-08  0.002284
1         Uncultured Bacteroidetes  TG(56:2).1 -0.698738  1.330755e-07  0.002345
2                    Moraxellaceae   PC(40:3e) -0.694186  1.733720e-07  0.002345
3      Ruminococcus gnavus et rel.    TG(50:4)  0.691191  2.058030e-07  0.002345
4  Lactobacillus plantarum et rel.    PC(40:3) -0.687798  2.493210e-07  0.002345
5      Ruminococcus gnavus et rel.  TG(54:6).1  0.683580  3.153275e-07  0.002345
6      Ruminococcus gnavus et rel.  TG(54:4).2  0.682030  3.434292e-07  0.002345
7      Ruminococcus gnavus et rel.    TG(52:5)  0.680622  3.709485e-07  0.002345
8                     Helicobacter    PC(40:3) -0.673201  5.530595e-07  0.003108
9                    Moraxellaceae  PC(38:4).1 -0.670050  6.530463e-07  0.003302
>>> import pandas as pd
>>> import dokdo
>>> otu = pd.read_csv('/Users/sbslee/Desktop/dokdo/data/miscellaneous/otu.csv', index_col=0)
>>> lipids = pd.read_csv('/Users/sbslee/Desktop/dokdo/data/miscellaneous/lipids.csv', index_col=0)
>>> dokdo.cross_association_heatmap(
...    otu, lipids, normalize='log10', nsig=1,
...    figsize=(15, 15), cmap='vlag', marksig=True,
...    annot_kws={'fontsize': 6, 'ha': 'center', 'va': 'center'}
... )

cross_association_heatmap_1

Compare above results with that from the tutorial:

Screen Shot 2022-03-08 at 12 50 18 PM

tutorial_heatmap

If you want to give this a try, please install the development version of Dokdo:

$ git clone https://github.com/sbslee/dokdo
$ cd dokdo
$ git checkout 1.13.0-dev
$ pip install -e .

Please let me know if you have any questions/suggestions/concerns. Thanks for the feature request!

khemlalnirmalkar commented 2 years ago

@sbslee, This is great, thank you so much for making this function available, My apologies for the late response, Is it possible to pick one bacteria and make a scatter plot with p-values on it? (i had added an image earlier) e.g. for the first bacteria from above results

          taxon      target      corr          pval      adjp
0      Ruminococcus gnavus et rel.  TG(54:5).2  0.716496  4.516954e-08  0.002284

Thanks

sbslee commented 2 years ago

I added a new method cross_association_regplot that I think gives you what you want.

import pandas as pd
import dokdo
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

otu = pd.read_csv('/Users/sbslee/Desktop/dokdo/data/miscellaneous/otu.csv', index_col=0)
lipids = pd.read_csv('/Users/sbslee/Desktop/dokdo/data/miscellaneous/lipids.csv', index_col=0)

df = dokdo.cross_association_table(otu, lipids, normalize='log10', nsig=1)
df.head(10)

To output:

                             taxon      target      corr          pval      adjp
0      Ruminococcus gnavus et rel.  TG(54:5).2  0.716496  4.516954e-08  0.002284
1         Uncultured Bacteroidetes  TG(56:2).1 -0.698738  1.330755e-07  0.002345
2                    Moraxellaceae   PC(40:3e) -0.694186  1.733720e-07  0.002345
3      Ruminococcus gnavus et rel.    TG(50:4)  0.691191  2.058030e-07  0.002345
4  Lactobacillus plantarum et rel.    PC(40:3) -0.687798  2.493210e-07  0.002345
5      Ruminococcus gnavus et rel.  TG(54:6).1  0.683580  3.153275e-07  0.002345
6      Ruminococcus gnavus et rel.  TG(54:4).2  0.682030  3.434292e-07  0.002345
7      Ruminococcus gnavus et rel.    TG(52:5)  0.680622  3.709485e-07  0.002345
8                     Helicobacter    PC(40:3) -0.673201  5.530595e-07  0.003108
9                    Moraxellaceae  PC(38:4).1 -0.670050  6.530463e-07  0.003302

Next, draw a scatter plot that shows association between Ruminococcus gnavus et rel. and TG(54:5).2:

ax = dokdo.cross_association_regplot(otu, lipids, 'Ruminococcus gnavus et rel.', 'TG(54:5).2')
ax.text(200, 1.7, 'P = 0.002284\nr = 0.72')

cross_association_regplot

Let me know what you think!

khemlalnirmalkar commented 2 years ago

Hi @sbslee, This is great, thank you so much for adding the method, Do i need to install the new version or v1.13.0 will work?

sbslee commented 2 years ago

The latest official version is 1.12.0, and the new methods are implemented in the 1.13.0-dev branch. Therefore, you would need to install the development version:

$ git clone https://github.com/sbslee/dokdo
$ cd dokdo
$ git checkout 1.13.0-dev
$ pip install .
khemlalnirmalkar commented 2 years ago

The latest official version is 1.12.0, and the new methods are implemented in the 1.13.0-dev branch. Therefore, you would need to install the development version:


$ git clone https://github.com/sbslee/dokdo

$ cd dokdo

$ git checkout 1.13.0-dev

$ pip install .

Okay, thanks @sbslee

sbslee commented 2 years ago

Hi @khemlalnirmalkar,

Have you had a chance to test the new methods (cross_association_table, cross_association_heatmap and cross_association_regplot)? I'd like to publish the official release for 1.13.0 soon and I just wanted to make sure the new methods behave as they should for end users. Please let me know :)

khemlalnirmalkar commented 2 years ago

@sbslee , Yes, i tried, and it looks great to me. Sorry, i didn't mention it in my previous comment. Thanks again for adding these new functions.

sbslee commented 2 years ago

Lovely, thanks for letting me know and for this PR!