perslab / CELLEX

CELLEX (CELL-type EXpression-specificity)
GNU General Public License v3.0
37 stars 9 forks source link

Metadata_class use #30

Closed andreanuzzo closed 3 years ago

andreanuzzo commented 3 years ago

Hi,

quick question: what is the purpose of the metadata_class use in the vignette? I assume it's needed to compute ESµ between conditions? It is not specified anywhere, not in the documentation or in the publication itself or in the longer CELLECT tutorial.

tstannius commented 3 years ago

Hi Andrea,

Thank you for your question.

I assume you refer to the metadata variable mentioned in the snippet below, which is provided for the annotation argument when creating a new ESObject.

import numpy as np
import pandas as pd
import cellex

data = pd.read_csv("./data.csv", index_col=0)
metadata = pd.read_csv("./metadata.csv", index_col=0) # this variable

eso = cellex.ESObject(data=data, annotation=metadata, verbose=True)
eso.compute(verbose=True)
eso.results["esmu"].to_csv("mydataset.esmu.csv.gz")

The metadata is essentially a map from cell-id to: a condition, a cell-type or cluster-id or other grouping that the user has defined.

cell_id cell_type
cell_1 type_A
... ...
cell_9 type_C

The metadata is, as you say, used to compute ESµ. To be specific, the metadata is used to group single cells and compute summary statistics for the group, which are then used by the various Expression Specificity metrics (or differential expression metrics) to calculate expression specificity for each gene in the group. The different ES metrics are later summarized in the ESµ metric.

I hope that answered your question. Let me know if there's anything I can clarify.

Best, Tobi

tstannius commented 3 years ago

Closing as question has been answered.

If you feel this is not the case, feel free to re-open.

andreanuzzo commented 3 years ago

Hi Tobias,

No, I wasn't referring to that, but to the metadata_class variable which is described in the tutorial, i.e cell 5 here

with loompy.connect(pathData) as ds:
    rows = (ds.row_attrs["Gene"])
    cols = (ds.col_attrs[nameId])
    #our data
    data = pd.DataFrame(ds[:, :], index=rows, columns=cols)
    # the type-annotation for individual cells
    metadata = pd.DataFrame(data={"cell_type" : ds.col_attrs[nameAnno]}, index=ds.col_attrs[nameId])
    metadata_class = pd.DataFrame(data={"cell_class" : ds.col_attrs[nameClass]}, index=ds.col_attrs[nameAnno])

That variable is assigned but not used anywhere. Does it mean CELLEX is able to determine ESµ specificities for other grouping, i.e. disease-related ESµ for each cell line?

tstannius commented 3 years ago

Hi Andrea,

Thanks for clarifying! I am not entirely familiar with this notebook, as it was developed by a MSc student in Pers Lab.

I agree that it appears this variable metadata_class is not used anywhere. My guess is that this MSc student used it instead of metadata for another kind of analysis.

Does it mean CELLEX is able to determine ESµ specificities for other grouping, i.e. disease-related ESµ for each cell line?

Yes - you can specify any grouping you like that you are interested in investigating!

andreanuzzo commented 3 years ago

Thanks Tobias!

I assume that in order to do this type of grouping I have to make specific dummy variables (i.e. concatenating cell names and disease status, like B_cell_healthy, B_cell_diseased, Plasma_healthy...) since it doesn't seem to me that the eso.compute() function can accept a list of metadata columns, am I correct?

tstannius commented 3 years ago

Yes, this is correct and sounds like a good approach.

Then you should get:

cell_id cell_type
cell_1 B_cell_healthy
... ...
cell_3 B_cell_diseased
... ...
cell_9 Plasma_healthy