simonwm / tacco

TACCO: Transfer of Annotations to Cells and their COmbinations
BSD 3-Clause "New" or "Revised" License
42 stars 1 forks source link

Mapping gene programs #13

Closed kmakino14 closed 7 months ago

kmakino14 commented 9 months ago

Dear all,

First of all, thanks for developing useful tools for analysis.

I would like to map a specific gene program in TACCO like your paper (https://www.biorxiv.org/content/10.1101/2022.10.02.508492v1.full).

Could you please let me know how to input the gene program as a reference? Do I input a matrix consisting of programs x genes?

Thanks!

Best, KM

simonwm commented 8 months ago

Dear KM,

thank you for your interest in TACCO! And sorry for the slow response.

Yes, you can input a gene-by-program matrix.

TACCO's annotate function operates on reference profiles, which are stored in a .varm slot of the reference AnnData. If they are not there from the beginning, they are generated from the observations e.g., in .X. But it does not care where they come from. If they are there already, they are used. This can be used to supply any reference profiles to be used like gene programs instead of celltypes. To supply profiles directly in the reference, one can for example create a fresh clean Anndata of the correct shape and populate a .varm slot:

import anndata as ad
import pandas as pd
# assuming you have a profiles_dataframe with your profiles in the columns and genes in the rows
profile_reference = ad.AnnData(var=profiles_dataframe[[]]) # create an AnnData of shape (0,n_genes) with the genes as .var.index, but otherwise empty
profile_reference.varm['profiles'] = profiles_dataframe

This profile_reference can be used as reference in calls to the annotate function.

In the docs of the annotate function this feature is a little hidden in the description of the annotation_key parameter. This is because the "side-loading of profiles" is a less standard use case and can lead to hickups later on in the processing. E.g., some annotation methods might require to have or work better if they have the expression itself in .X for calculating the prior for the frequencies of the profiles in the data or for calculating a standard deviation of the profiles. This could also be done if instead of populating a .varm key one populates .X with "cells" sampled per profile according to the expected standard deviation and according to expected relative frequencies. But this becomes increasingly hacky and is therefore not recommended.

I hope this helps!