sbslee / dokdo

A Python package for microbiome sequencing analysis with QIIME 2
https://dokdo.readthedocs.io
MIT License
42 stars 12 forks source link

How to plot many species on a bar graph #53

Closed yonghyun09 closed 1 year ago

yonghyun09 commented 1 year ago

Hello sbslee,

Thank you for using the dokdo package you provided. I have a question regarding taxa bar plots.

I recently performed my first 16S rRNA analysis and successfully completed the Qiime2 pipeline, and I am currently trying to visualize it.

However, unlike the easy practice in the tutorial, my sample actually analyzed included numerous species based on Genus level, and I encountered an unintended error when displaying the qzv file as taxa bar plots.

I'm guessing this is the reason why it's hard to include as a legend because there are so many species, I'd like to ask how to control it. I am studying matplotlib, etc., but I am at a beginner level in using Python, so please understand that I lack application skills.

In addition, if you look at the microbiome analysis papers, most of them visualize only some species of the genus level, and the rest are expressed as 'the others'. In my case, the genus I want to observe is Salmonella with a relative abundance of 0.1%, and I would like to mark all genus above that ratio (about 70 species). However, this is also extensive to include in a taxa bar & heatmap, so I would like to ask for advice on this.

I upload reference photos of my samples observed in qiime view and errors encountered during analysis as follows.

thank you very much.

5 1 2 3 4

sbslee commented 1 year ago

@yonghyun09,

Thanks for using Dokdo! Could you send me the QZV file at sbstevenlee@gmail.com so I can take a look at it? Also, which version of Dokdo are you using? The latest version is 1.16.0. Also, which version of QIIME 2?

yonghyun09 commented 1 year ago

@sbslee

Thank you for your reply. My version information is below. The qzv file has been sent by e-mail.

Qiime2 version : 2022-2 Dokdo version : 1.15.0

Since the latest update of conda and qiime2, support for virtualBox has been changed to docker, so I am maintaining the previous version.

sbslee commented 1 year ago

@yonghyun09,

Thanks so much for sending the file. The issue here is that your taxa names don't have any prefix (e.g. k__ in k__Bacteria) that is used by Dokdo to distinguish between bacteria vs. metadata columns (e.g. Bacteria vs. washing-or-not in your file). Which reference microbiome database did you use to classify ASVs? This could happen if you did not use the Silva database. For instance, using Silva would have produced k__Bacteria instead of your Bacteria.

That being said, the taxa_abundance_bar_plot function should work even if the user did not classify their ASVs with Silva. I will try to see if there are ways to distinguish between bacteria vs. metadata columns without relying on the prefix.

Thanks for your patience.

Steven

yonghyun09 commented 1 year ago

@sbslee

Hello thank you for your reply. I initially tried to analyze my samples based on the SILVA and Greengenes databases provided by Qiime2. However, these databases had limitations in not being able to classify Salmonella. (link for reference : https://forum.qiime2.org/t/silva-138-classifier-is-not-classifying-salmonella-at-genus-level/16765)

Therefore, I used the 16S database of 'EzBioCloud' that can classify Salmonella. I hope it can be solved well. Thank you for your kind help.

sbslee commented 1 year ago

@yonghyun09,

Good news! I was able to fix the issue. With the latest development version of Dokdo (1.17.0-dev), you can produce the following:

import dokdo
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(10, 7), gridspec_kw={'width_ratios': [9, 1]})

qzv_file = 'taxa-bar-plots.qzv'

dokdo.taxa_abundance_bar_plot(
    qzv_file,
    ax=ax1,
    level=6,
    count=20,
    cmap_name='tab20',
    legend=False
)

dokdo.taxa_abundance_bar_plot(
    qzv_file,
    ax=ax2,
    level=6,
    count=20,
    cmap_name='tab20',
    legend_short=True
)

handles, labels = ax2.get_legend_handles_labels()

ax2.clear()
ax2.legend(handles, labels)
ax2.axis('off')

plt.tight_layout()

fig1

As for your second question on how to show low abundant microbiome in your dataset (e.g. Salmonella), I would use taxa_abundance_box_plot instead of taxa_abundance_bar_plot:

taxa_names = [
    'Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Salmonella',
    'Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus',
    'Bacteria;Firmicutes;Bacilli;Lactobacillales;Carnobacteriaceae;Trichococcus',
    'Bacteria;Proteobacteria;Epsilonproteobacteria;Campylobacterales;Campylobacteraceae;Arcobacter',
]

dokdo.taxa_abundance_box_plot(
    qzv_file,
    level=6,
    taxa_names=taxa_names,
    show_others=False,
    pretty_taxa=True,
    figsize=(8, 7)
)
plt.tight_layout()

fig2

Please note that in order to reproduce the results above, you need to re-install the latest development version of Dokdo:

$ git clone https://github.com/sbslee/dokdo
$ cd dokdo
$ git checkout 1.17.0-dev
$ pip install .

Let me know if you have further questions.

yonghyun09 commented 1 year ago

@sbslee

OMG, thank you so much! If possible, I'd like to give you a Github Star like a tennis ball machine.

Additional questions about visualization study will need to be identified as i go through the analysis, but the version you provided make it easy to get started. thank you very much.

If you don't mind, can I ask an additional question related to other fields? I don't think there will be an opportunity to ask a bioinformatics expert like you, so I would appreciate it if you could give me some advice.

The lab I belong to is currently in the process of setting up microbiome research, and is experiencing difficulties in purchasing a computer needed for BI analysis.

After the first 16S rRNA analysis through outsourcing from Macrogen, Qiime2 analysis was performed by installing a virtual machine on a Windows computer with 16GB of RAM, (in VirtualBox, 10GB) but the use of DADA2, Classifier was limited due to lack of performance, and that steps were able to be taken with the cooperation of other research institutes. (which did not run on my existing Windows desktop virtual machine, was completed by running for about 8 hours on another institution's iMAC (RAM 8GB).)

I'm currently considering buying a personal computer(for BI analysis) because my lab computer setup is getting late. And i am considering a laptop such as below (The cost is under consideration at around 2 million won or less.)

I am currently planning to purchase a computer with suitable specifications (planning to set up a MAC or Linux system), and I would like to ask if I can get any recommendations for one of the above or another.

And as considered above, as a personal computer, I'm considering using a laptop because the desktop would be inconvenient. Also, the lab desktop is considered to be a setup someday, so I'm considering a laptop for personal use. Our laboratory does not have a server, so analysis using a server is expected to be limited.

I would be very grateful if you could give me some advice on these difficulties.

Thank you very much.

sbslee commented 1 year ago

@yonghyun09,

Glad to hear that you find my work useful.

Quick disclaimer: Since you mentioned Macrogen, I actually work there :)

As for recommending a laptop for bioinformatics, there are so many considerations to be made that I don't think I can give you a definitive answer. I will only say this: I have been using macOS machines as personal computer for my entire BI career. This includes MacBook Air (both Intel and apple chips), MacBook Pro (both Intel and apple chips), and Mac mini (apple chip). I like macOS machines for BI analysis because macOS is a Unix operating system and most BI softwares can be run with it.

Hope this helps.

yonghyun09 commented 1 year ago

@sbslee

Thank you, i think I can choose a computer based on the advice you gave.

Actually, it was nice to see that you are working for Macrogen in your profile! :) I hope that even a good thesis can be written using the dokdo provided by you. I will use it well in the future. thank you very much.