qiime2 / q2-metadata

BSD 3-Clause "New" or "Revised" License
3 stars 17 forks source link

new method: `explore`: plot sample metadata categories/values #14

Open nbokulich opened 6 years ago

nbokulich commented 6 years ago

Proposed Behavior Example and idea provided by @elong0527 and issue moved from q2-longitudinal:

image

X-axis = time (or other continuous metadata column) (possibly also support categorical columns?)

y-axis = subject ID (e.g., to support plotting individuals that are plotted repeatedly over time). This was originally planned for q2-longitudinal but should be generalized for non-longitudinal sampling designs — perhaps y-axis should be an optional parameter (if True, plot as scatter plot; if false, plot barplot?)

points colored by group category (should accept categorical or continuous metadata, infer type, and color-code accordingly)

Questions Could also add a parameter to change size or shape of points based on other optional metadata category inputs???

elong0527 commented 6 years ago

Below is an implementation of the scatterplot. I am not sure which file should I save the function. So I keep the code here :)

One thing I am not sure is how QIIME2 export figures. @nbokulich could you help me on that? Thanks !

def design_plog(metadata: qiime2.Metadata,
                individual_id_column: str,
                individual_time_column: str,
                individual_group_column: str,
                fig_width: int,
                fig_height: int):

  # load and prep metadata
  metadata = _load_metadata(metadata)
  _validate_metadata_is_superset(metadata, table)
  metadata = metadata[metadata.index.isin(table.index)]

  # validate id column  (#How could I ensure, time column is a int/numeric?)
  _validate_input_columns(metadata, individual_id_column, None, None, None)
  _validate_input_columns(metadata, individual_time_column, None, None, None)
  _validate_input_columns(metadata, individual_group_column, None, None, None)

  _design_plot(sample_md, individual_id_column, individual_time_column,
               individual_group_column, fig_width, fig_height)

def _design_plot(sample_md,
                 individual_id_column,
                 individual_time_column,
                 individual_group_column,
                 fig_width,
                 fig_height):
    '''Function to create study design plot.
    sample_md: pd.DataFrame
        Sample metadata
    individual_id_column: str
        Metadata column containing IDs for individual subjects
    individual_time_column: str
        Metadata column containing sample collection time for individual subjects
    individual_group_column: str
        Metadata column containing group indicator of individual subjects
    fig_width: int
        Figure Width
    fig_height: int
        Figure Height
    '''

    sample_md = sample_md.rename(columns={individual_id_column: 'id',
                                  individual_time_column: 'time',
                                  individual_group_column: 'group'})

    sample_md["id_loc"] = sample_md["id"].astype('category').cat.codes
    # Keep for potential operation of the label
    sample_md["id_label"] = sample_md["id"]

    u_group = sample_md["group"].unique()
    n_group = len(u_group)
    sample_md_meta = sample_md[["id", "id_loc", "id_label"]]
    sample_md_meta = sample_md_meta.drop_duplicates().reset_index(drop=True)

    plt.figure(figsize=(fig_width, fig_height))

    for grp in u_group:
        _md = sample_md[sample_md.group == grp]
        plt.scatter(_md.time, _md.id_loc, label = grp)

    plt.xlabel(individual_time_column)
    plt.yticks(sample_md_meta["id_loc"], sample_md_meta["id_label"])
    plt.ylabel(individual_id_column)
    plt.legend(loc=9, bbox_to_anchor = (0.5, -0.1), ncol = n_group)

# Test 
from matplotlib import pyplot as plt
import pandas as pd

sample_md_fp = "ecam_map_maturity.txt"
sample_md = pd.DataFrame.from_csv(sample_md_fp, sep='\t')
_design_plot(sample_md, "studyid", "month", "diet_3", 6, 8)
plt.show()
nbokulich commented 6 years ago

thanks @elong0527 ! I think for now the best thing to do is add these functions to my fork of q2-metadata, that way we can work together on this (e.g., I can review what you have put together and add a visualization template that displays the plots) before making a pull request into the main repository. @jairideout does this sound like a good plan?

@elong0527 could you please add these functions into a new file named _explore.py in this directory and make a pull request into my branch? Do not add the test that you wrote — we will work on tests later after we figure out which test data, etc, we will use.

Also now that this action is in q2-metadata instead of q2-longitudinal we will probably want to make it usable on categorical data as well as numerical data — see the notes that I made in the first post in this thread, and we should test whether these scatter plots can still be made with categorical data on the x-axis.

If you have any questions on how to make a pull request into my fork, etc, please just email me directly.

jairideout commented 6 years ago

@jairideout does this sound like a good plan?

Sounds perfect!