sbslee / dokdo

A Python package for microbiome sequencing analysis with QIIME 2
https://dokdo.readthedocs.io
MIT License
42 stars 12 forks source link

how to make heatmap with text file #22

Closed khemlalnirmalkar closed 3 years ago

khemlalnirmalkar commented 3 years ago

Hi @sbslee , Is it possible to make a heatmap using a panda dataframe (not qza)? i am trying to replicate the same style like bar plot, splitted categories with and without dendrogram. Please can you help?

like this image image

Thanks, Khem

sbslee commented 3 years ago

@khemlalnirmalkar, thanks for the question.

First of all, I just updated the dokdo.heatmap method in the 1.11.0-dev branch to accept pandas.DataFrame (click here to see the documentation):

import dokdo
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
csv_file = '/Users/sbslee/Desktop/dokdo/data/moving-pictures-tutorial/table.csv'
metadata_file = '/Users/sbslee/Desktop/dokdo/data/moving-pictures-tutorial/sample-metadata.tsv'
df = pd.read_csv(csv_file, index_col=0)
dokdo.heatmap(df,
              metadata=metadata_file,
              normalize='clr',
              hue1='body-site')
plt.savefig('heatmap-1.png')

Example CSV file for creating input pandas.DataFrame: table.csv

heatmap-1

Next, if you want to group the samples but do not want to cluster them, you can set row_cluster=False:

dokdo.heatmap(df,
              metadata=metadata_file,
              normalize='clr',
              hue1='body-site',
              row_cluster=False)
plt.savefig('heatmap-2.png')

heatmap-2

I understand these are not exactly what you wanted, but I hope this is a good starting point. What do you think?

khemlalnirmalkar commented 3 years ago

@sbslee , Yes, this is great. Thanks for updating the code. I hope we can rotate the plot too, like keeping the taxa on the right top and samples on the bottom. Heatmap, ComplexHeatmap and some other R & py packages allow annotation and splitting of the heatmap. I like dokdo and It would be great if Dokdo can allow similar changes, Thanks again, Khem

sbslee commented 3 years ago

@khemlalnirmalkar,

If it's something that's not too difficult to implement, I'd be more than happy to make updates do Dokdo. So please keep letting me know :) That being said, I updated the dokdo.heatmap method to allow flipping of the x and y axes:

import dokdo
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
csv_file = '/Users/sbslee/Desktop/dokdo/data/moving-pictures-tutorial/table.csv'
metadata_file = '/Users/sbslee/Desktop/dokdo/data/moving-pictures-tutorial/sample-metadata.tsv'
df = pd.read_csv(csv_file, index_col=0)
dokdo.heatmap(df,
              metadata=metadata_file,
              flip=True,
              normalize='clr')
plt.savefig('heatmap-3.png')

heatmap-3

Next up is splitting a heatmap by the sample group. This one is trickier because each sample group will have a different clustering pattern because their microbiome composition will be different. In the example you provided, there are only 30 taxa labeled, which indicates to me that the clustering was performed earlier and you are just using the same order of those 30 taxa (I'm not sure how you got the exact order) for all four sample groups. Can you give me a link to the method's documentation so I can read about it? Or explain to me how you generated the example?

khemlalnirmalkar commented 3 years ago

Hi @sbslee My apologies totally distracted by some other work. The abundance file I sent you, was generated through Kraken2 https://github.com/DerrickWood/kraken2/wiki/Manual As I know, there was no clustering method was applied. I just wanted to make a heatmap with my own order. Thanks

sbslee commented 3 years ago

@khemlalnirmalkar,

Good news! Once I realized that some users like yourself don't need (or want) clustering of their samples and/or taxa, I decided to rename the heatmap method to clustermap (documentation) and devise a new method called heatmap (documentation). This is implemented in the 1.11.0-dev branch.

Below is a simple example:

import dokdo
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
qza_file = '/Users/sbslee/Desktop/dokdo/data/moving-pictures-tutorial/table-l3.qza'
dokdo.heatmap(qza_file,
              normalize='log10',
              flip=True,
              figsize=(10, 7))
plt.tight_layout()

heatmap-1

We can display a heatmap for each sample group:

fig, [ax1, ax2, ax3, ax4, ax5] = plt.subplots(1, 5,
                                              figsize=(20, 8),
                                              gridspec_kw={'width_ratios': [1, 1, 1, 1, 0.1]})

qza_file = '/Users/sbslee/Desktop/dokdo/data/moving-pictures-tutorial/table-l3.qza'

gut_samples = ['L1S8', 'L1S57', 'L1S76', 'L1S105', 'L1S140', 'L1S208', 'L1S257', 'L1S281'] 
left_palm_samples = ['L2S155', 'L2S175', 'L2S204', 'L2S222', 'L2S240', 'L2S309', 'L2S357', 'L2S382']
right_palm_samples = ['L3S242', 'L3S294', 'L3S313', 'L3S341', 'L3S360', 'L3S378', 'L4S63', 'L4S112', 'L4S137']
tongue_sampels = ['L5S104', 'L5S155', 'L5S174', 'L5S203', 'L5S222', 'L5S240', 'L6S20', 'L6S68', 'L6S93']

kwargs = dict(normalize='log10',
              flip=True,
              linewidths=0.5,
              xticklabels=True)

dokdo.heatmap(qza_file, ax=ax1, samples=gut_samples, cbar=False, **kwargs)
dokdo.heatmap(qza_file, ax=ax2, samples=left_palm_samples, yticklabels=False, cbar=False, **kwargs)
dokdo.heatmap(qza_file, ax=ax3, samples=right_palm_samples, yticklabels=False, cbar=False, **kwargs)
dokdo.heatmap(qza_file, ax=ax4, samples=tongue_sampels, yticklabels=False, cbar_ax=ax5, **kwargs)

ax1.set_title('Gut')
ax2.set_title('Left palm')
ax3.set_title('Right palm')
ax4.set_title('Toungue')

plt.tight_layout()

heatmap-2

Let me know what you think. I will probably make some additional updates to this method before releasing the official 1.11.0 version.

khemlalnirmalkar commented 3 years ago

@sbslee, This is great, thank you so much. Quick question, can we group the samples based on category (like bar plot has an option "group order") instead of using each sample ID? can we normalize the abundance by z-score too? Once again, I appreciate your constant help,

Khem

sbslee commented 3 years ago

@khemlalnirmalkar,

Those are great questions. I updated the heatmap method to 1) support the Z score normalization and 2) accept the metadata and where arguments:

import dokdo
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

qza_file = '/Users/sbslee/Desktop/dokdo/data/moving-pictures-tutorial/table-l3.qza'
metadata_file = '/Users/sbslee/Desktop/dokdo/data/moving-pictures-tutorial/sample-metadata.tsv'

fig, [ax1, ax2, ax3, ax4, ax5] = plt.subplots(1, 5,
                                              figsize=(20, 8),
                                              gridspec_kw={'width_ratios': [1, 1, 1, 1, 0.1]})

kwargs = dict(normalize='zscore',
              flip=True,
              linewidths=0.5,
              metadata=metadata_file,
              xticklabels=True)

qza_file = '/Users/sbslee/Desktop/dokdo/data/moving-pictures-tutorial/table-l3.qza'

dokdo.heatmap(qza_file, ax=ax1, where="[body-site] IN ('gut')", cbar=False, yticklabels=True, **kwargs)
dokdo.heatmap(qza_file, ax=ax2, where="[body-site] IN ('left palm')", yticklabels=False, cbar=False, **kwargs)
dokdo.heatmap(qza_file, ax=ax3, where="[body-site] IN ('right palm')", yticklabels=False, cbar=False, **kwargs)
dokdo.heatmap(qza_file, ax=ax4, where="[body-site] IN ('tongue')", yticklabels=False, cbar_ax=ax5, **kwargs)

ax1.set_title('Gut')
ax2.set_title('Left palm')
ax3.set_title('Right palm')
ax4.set_title('Toungue')

plt.tight_layout()
plt.savefig('heatmap-updated.png')

heatmap-updated

P.S. AFAIK, microbiome relative abundance data will give best results when normalized with log-based transformation such as log10 and CLR. This is because those data often contain lots of zeros and also extremely large values; therefore, the Z score transformation, which is linear and not log, may not be suitable for this kind of data. Please use this option carefully.

khemlalnirmalkar commented 3 years ago

@sbslee Thanks for the quick action. Agree, log10 or CLR would be better for microbial abundance. I was thinking to use Z score for metabolites data :). Thank you so much for everything, Khem

khemlalnirmalkar commented 3 years ago

@sbslee, I have a question on heatmap code When i am splitting the heatmap with categories, i am losing the sample order, see the attached image. first, start with MS117 but when split, it changes. for paired samples, its important to keep the order. how can i fix it? image

image

Dokdo needs "s__" for taxa, but is it possible to avoid it? i am thinking to use for metabolites too.

Thanks,

sbslee commented 3 years ago

@khemlalnirmalkar,

  1. As for the s__ prefix, you don't have to worry about that for the heatmap and clustermap methods. That concern is only for the taxa_abundance_bar_plot method. For example, if you look at the very first plot I posted in this thread, you will see that I provided a CSV file with taxa names that do not have any prefix.

  2. If it's truly important to keep the exact order of samples within each group level, you should definitely use the samples argument instead of the where argument. The latter uses QIIME 2 API which returns filtered sample IDs in a set object which erases the original order of samples. Assuming your input data is pandas.DataFrame and your samples (rows) are already perfectly ordered, you can simply do something like:

dokdo.heatmap(df, ax=ax1, ..., samples=df.index[0:5])
dokdo.heatmap(df, ax=ax2 ..., samples=df.index[0:2])
dokdo.heatmap(df, ax=ax3 ..., samples=df.index[0:18])
...

Hope this helps. Let me know if you need more help.

sbslee commented 3 years ago

@khemlalnirmalkar,

I thought about this more, and I think the following might help in your case. Basically, I added the sort_samples argument in the heatmap method:

import dokdo
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
qza_file = '/Users/sbslee/Desktop/dokdo/data/moving-pictures-tutorial/table-l3.qza'
dokdo.heatmap(qza_file,
              normalize='log10',
              flip=True,
              figsize=(10, 7))
plt.tight_layout()
plt.savefig('heatmap-1.png')

heatmap-1

metadata_file = '/Users/sbslee/Desktop/dokdo/data/moving-pictures-tutorial/sample-metadata.tsv'

fig, [ax1, ax2, ax3, ax4, ax5] = plt.subplots(1, 5,
                                              figsize=(20, 8),
                                              gridspec_kw={'width_ratios': [1, 1, 1, 1, 0.1]})

kwargs = dict(normalize='log10',
              flip=True,
              linewidths=0.5,
              sort_samples=True,
              metadata=metadata_file,
              xticklabels=True)

qza_file = '/Users/sbslee/Desktop/dokdo/data/moving-pictures-tutorial/table-l3.qza'

dokdo.heatmap(qza_file, ax=ax1, where="[body-site] IN ('gut')", cbar=False, yticklabels=True, **kwargs)
dokdo.heatmap(qza_file, ax=ax2, where="[body-site] IN ('left palm')", yticklabels=False, cbar=False, **kwargs)
dokdo.heatmap(qza_file, ax=ax3, where="[body-site] IN ('right palm')", yticklabels=False, cbar=False, **kwargs)
dokdo.heatmap(qza_file, ax=ax4, where="[body-site] IN ('tongue')", yticklabels=False, cbar_ax=ax5, **kwargs)

ax1.set_title('Gut')
ax2.set_title('Left palm')
ax3.set_title('Right palm')
ax4.set_title('Toungue')

plt.tight_layout()
plt.savefig('heatmap-2.png')

heatmap-2

Let me know what you think.

khemlalnirmalkar commented 3 years ago

@sbslee Awesome, Thank you so much for the update :) I am sorry for keep asking, now i need one more suggestion, If you see this image, for Prevotella, the abundance level is very low. if i want to avoid making a separate heatmap for Prevotella, Is it possible to stretch more the scale from 0-0.1? or increase the color range for scale 0-0.1.

image

Thanks again,

sbslee commented 3 years ago

@khemlalnirmalkar,

No worries, you are more than welcome to ask as many questions as you want!

Generally speaking, the dokdo.heatmap method uses the seaborn.heatmap method, and you can provide vmin and vmax arguments to both methods to play with scaling of the colormap. Those arguments weren't explicitly exposed to the dokdo.heatmap method before (you were still able to provide them as **kwargs), but I just updated so that they are now.

import dokdo
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
qza_file = '/Users/sbslee/Desktop/dokdo/data/moving-pictures-tutorial/table-l3.qza'

fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(15, 7))

dokdo.heatmap(qza_file,
              ax=ax1,
              normalize='log10',
              flip=True,
              pretty_taxa=True,
              figsize=(10, 7))

dokdo.heatmap(qza_file,
              ax=ax2,
              normalize='log10',
              flip=True,
              pretty_taxa=True,
              vmin=-2,
              figsize=(10, 7))

plt.tight_layout()
plt.savefig('heatmap-1.png')

heatmap-1

That being said, I'm getting the impression that you want to adjust the coloring scale just for selected species (e.g. Prevotella species), and if that's the case, then it's not possible with the dokdo.heatmap method. In the heatmap, once you apply the normalization (e.g. log10 or CLR) then everything becomes relative to one another.

You could selectively perform some sort of scaling in your input pandas.DataFrame and then use the normalize=None setting for the heatmap, but I don't recommend that either because I think it could be misleading to the readers.

I think the best bet is drawing another heatmap for Prevotella species, as you mentioned, but I personally find your plot above perfectly acceptable, so I wouldn't even bother drawing another plot. But that's just my take! Feel free to ask any follow-up questions.

khemlalnirmalkar commented 3 years ago

@sbslee earlier, I did try with vmin and vmix but was not helpful, I will give it a try with a separate plot and will see, how it looks. Otherwise just one plot. Thanks for your help,

sbslee commented 3 years ago

@khemlalnirmalkar,

Sounds good. One thing to note is that if you are going to plot a separate figure just for Prevotella species, you may have to create a pandas.DataFrame containing only those species. Otherwise, your normalization will still be affected by the presence of other extremely high or low species. Hope it works.

khemlalnirmalkar commented 3 years ago

@sbslee Agreed..yes, the plan is use panda.dataframe only for Prevotella sps. Thanks for the suggestion :)