sbslee / dokdo

A Python package for microbiome sequencing analysis with QIIME 2
https://dokdo.readthedocs.io
MIT License
42 stars 12 forks source link

Select samples for PCoA #49

Closed rupesh-sinha closed 1 year ago

rupesh-sinha commented 1 year ago

Hi sbslee, I was using dokdo for plotting PCoA as below:

import dokdo
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

qza_file = '/home/rupesh/rupesh/qiime2/new_qiime/uchime-dn-out_wdout_clustering/table-nc-wobl-nmcl.qza'
metadata_file = '/home/rupesh/rupesh/qiime2/new_qiime/sample_metada.tsv'

pcoa_results = dokdo.ordinate(qza_file)

dokdo.beta_2d_plot(
    pcoa_results,
    hue='matrix',
    metadata=metadata_file,
    figsize=(8, 8)
)

plt.tight_layout()

which is working fine. But, I was wondering if we can have function like 'where' or 'include_samples' or 'exclude_samples' or something like that to control the samples to be presented on the plot.

Thanks

sbslee commented 1 year ago

@rupesh-sinha,

Thank you for the feature request!

The reason I didn't include options like where to the beta_2d_plot function is because I was afraid the option might be abused by users, both intentionally and unintentionally. As you may already know, when presenting a PCoA plot you are supposed to show all samples that were used to calculate the original distance matrix. If a given PCoA plot is showing only a subset of samples, then it should not be used to make any definitive conclusions. For example, when you try to subset some samples in an emperor QZV file using QIIME 2 View, it will still do it for you but with a warning: "WARNING: hiding samples in an ordination can be misleading". Below you will see that I removed "left palm" samples (blue). The warning is displayed on the top left corner.

Screen Shot 2022-12-01 at 3 45 37 PM

Therefore, if the goal is to see how PCoA looks in the absence of certain samples, then you should calculate a new distance matrix and plot it again. This can be easily done within Dokdo and an example of this is illustrated in the ordinate function's documentation (compare the first and second figures). I will show the code example below:

import dokdo
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

qza_file = '/Users/sbslee/Desktop/dokdo/data/moving-pictures-tutorial/table.qza'
metadata_file = '/Users/sbslee/Desktop/dokdo/data/moving-pictures-tutorial/sample-metadata.tsv'

pcoa_results = dokdo.ordinate(qza_file)

dokdo.beta_2d_plot(
    pcoa_results,
    metadata=metadata_file,
    hue='body-site',
    figsize=(8, 8)
)

plt.tight_layout()

ordinate-1

from qiime2 import Metadata

mf = dokdo.get_mf(metadata_file)
mf = mf[mf['body-site'].isin(['gut', 'left palm'])]

pcoa_results = dokdo.ordinate(qza_file, metadata=Metadata(mf))

dokdo.beta_2d_plot(
    pcoa_results,
    metadata=metadata_file,
    hue='body-site',
    figsize=(8, 8)
)

plt.tight_layout()

ordinate-2

I think there could still be an argument to be made in the favor of adding options like where to the beta_2d_plot function. For example, the user may want to explore the PCoA space in greater depth. However, if that's the case the user should use QIIME 2 View because that's why the platform exists in the first place.

Hope this helps!

rupesh-sinha commented 1 year ago

Thanks @sbslee

This works perfectly. I understand the concern for hiding some samples and was looking for something like this to recompute the distance matrix for the selected samples. Thanks for your excellent support with dokdo!

When I recompute using the function you suggested in the second part of the script I get a plot with warning message like this: /home/rupesh/anaconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/sklearn/metrics/pairwise.py:1776: DataConversionWarning: Data was converted to boolean for metric jaccard warnings.warn(msg, DataConversionWarning)

What does this actually mean? Again thanks for your support.

sbslee commented 1 year ago

@rupesh-sinha,

Glad to know that Dokdo is helpful to you!

As for the warning, you don't have to worry about it. It simply means that numeric data was converted to boolean (e.g. 442 --> True, and 0 --> False). This conversion was needed because you are using the Jaccard distance as distance metrics, which computes distance between samples based on the presence/absence of microbiome instead of using their abundance. Hope this helps.