scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
535 stars 150 forks source link

Additional modes (surface proteins, methylation, …) #237

Closed gokceneraslan closed 1 year ago

gokceneraslan commented 4 years ago

In cell hashing or CITE-seq datasets, an additional count matrix is produced and it stores how many ADT/HTO barcodes (see this for details https://genomebiology.biomedcentral.com/track/pdf/10.1186/s13059-018-1603-1) a cell has. Then depending on the experiment type we use these barcodes to either demultiplex cells into their original "sources" (HTO case) or to quantify protein expression (ADT case). Here is the PBMC HTO file from Seurat tutorial (rows are sample names i.e. HTO barcode IDs, columns are cells):

image

Link to file: https://www.dropbox.com/sh/c5gcjm35nglmvcv/AABGz9VO6gX9bVr5R2qahTZha?dl=0&preview=pbmc_hto_mtx.rds

Right now, we can store these counts in adata.obsm as @wflynny suggested, however there is no place to store barcode strings (or protein names or sample names depending on what barcodes represent) in adata.obsm. One hack would be to store them in adata.uns but that'd be very ugly. Alternatively one can store everything in adata.obs but that'd also pollute the obs and ignore the multivariate nature of the barcodes.

What would be a good solution here? Would it make sense to add "column names" to an adata.obsm? Since they're currently stored as numpy arrays this seems infeasible but how much effort is it to store them as dataframes (like obs) instead of matrices? Alternatively, sc.get might allow us to access a group of columns in adata.obs for convenience.

Let me know what you think. (Btw, related discussion: https://github.com/theislab/scanpy/issues/351)

gokceneraslan commented 4 years ago

Maybe it's important to separate ADT (cite-seq) and HTO (hashing) cases because ADT is more in line with multimodal analysis which also includes protein+rna or atac+rna (e.g. https://www.nature.com/articles/s41587-019-0290-0). Scanpy's behaviour for multimodality is undefined since computational pipelines are not established. But the hashing case is very relevant for Scanpy because it enables the analysis of multiplexed samples.

flying-sheep commented 4 years ago

In git master, pandas DataFrames are allowed as obsm entries. Check out #228 for a discussion.

I’m inclined to allow DataFrames there as long as they’re homogeneous, (inhomogeneous ones can be represented as hierarchical column index in obs, right?)

gokceneraslan commented 4 years ago

Oh ok, I didn't know about the discussion at all :) That's super cool, I wrote a comment there.

fidelram commented 4 years ago

@gokceneraslan I like your idea of storing HTO/ADT into .obsm as this seems to almost ready to be used. However, as you point out, there are different use cases that we may want to consider to see if .obsm can accommodate all the functionality required.

In general, what seems to be missing in scanpy is a placeholder for matrices that contain the same number of cells but different number of variables. Currently, the layers address the case of same cells, same genes. The missing relationship should be similar the the one between adata and adata.raw but extended to multiple cases, for example adata.citeseq, adata.atac etc. In Seurat, I think this is implemented as assays.

Under the above scenario, I could envision that something like this should be possible (extrapolating from current adata.raw functionality in which adata.obs is shared):

sc.tl.leiden(adata.citeseq, key_added='citeseq_clustering')

This should use the matrix and neighbors information contained in adata.citeseq, which itself is an AnnData object (as is already the case for adata.raw), and update adata.obs by adding the citeseq_clusterin column.

Implementing something like this does not look like a low hanging fruit but given the code already available for adata.raw maybe the possibility of such extension can be investigated.

flying-sheep commented 4 years ago

I’m currently thinking about multimodal data. I think adding them to the count matrix is a bad idea. Instead I’m thinking about this:

  1. Unify X and layers. X will become property(lambda self: self.layers[None])
  2. We create wrappers with another primary mode like adata_prot = adata.mode('prot')
  3. Operations on that wrapper will operate on the layers of the active mode.

Questions

Alternatives:

ivirshup commented 4 years ago

I like the ideas in this discussion! While I've definitely got some thoughts on this, I wanted to point to how other people were dealing with this problem – so we can think about the pros and cons of their approaches:

Brief, general questions:

As is done with layers in parts, all functions gain a parameter to specify which layer to operate on.

Why do you think this would be too painful? I've played around with this a bit, and think it's promising. In particular I think it reduces a lot of complexity by establishing some shared code for argument handling.

flying-sheep commented 4 years ago

About naming: Seurat calls its different modes “assays” SCE calls its layers “assays”. Confusing! I don’t want us to contribute to confusion, but why does nobody call this the way to store multimodal data “modes” outside of this issue? altExperiment is cumbersome, sidecar is cute.

Should we account for relationships across modalities? E.g. connect the transcript expression to the protein?

Good question. I’d assume the relationship here may even be a many-to-many graph for each pair of modes, right?

Why do you think this would be too painful?

We’d have to modify every single function, which is reason enough. If we do it my way, 3rd party code will just work instead of us having to bug them to change it. Also it’s easier to miss a mode='CITE' deep in the param list than using the wrong object as first parameter.

fidelram commented 4 years ago

@flying-sheep I support your arguments against something like layers.

I don't understand exactly what you mean with making AnnData stateful. I guess that by setting adata.mode = 'prot' your suggestion is that this will set adata.X to contain 'prot' data (thus changing the shape) and move whatever was in adata.X, adata.var, adata.varm etc. to for example a RNA container? If that is the case, I also don't like it, especially because the mode is hidden away.

With respect to your preferred suggestion, why do you call it a 'wrapper'? I think you have something quite clear in mind already on how to implement this. I like the idea of using the word mode. That would be my preferred naming.

With respect to your question about what is global and what is mode specific I would say that anything related to obs and obsm is global because that's the axis shared by all modes. Otherwise, each mode should have its own .uns, .var, .var_names

Under your proposal, the following statements should be equivalent?

sc.tl.leiden(adata.mode['prot'], key_added='prot_clustering')
# now adata.obs has 'prot_clustering'
adata_prot = adata.mode('prot')
sc.tl.leiden(adata_prot, key_added='prot_clustering')
# now adata.obs has 'prot_clustering'
flying-sheep commented 4 years ago

With respect to your preferred suggestion, why do you call it a 'wrapper'?

The idea is that it isn’t an AnnData object, it just behaves like one. All operations regarding obs go directly to a parent AnnData object, with the difference that shape, X, layers, var{,m,p}, are for a different mode.

Our API is already pretty big so I don’t think we should make different modes’ .var, … available directly via something like adata.mode_var['prot] or so. I think adata.mode['prot'].var is clean enough.

Under your proposal, the following statements should be equivalent?

Hmm, That’s the same code, except that one uses __getitem__ and the other __call__. I’m open on which one to use, but it shouldn’t be both. I was also thinking about only allowing the following, but I think it might be cumbersome? Maybe it’s clearer :thinking:

with adata.mode('prot') as adata_prot:
    sc.tl.leiden(adata_prot, key_added='prot_clustering')

That would force the use of alternative modes to be temporary, which would be clean for operations, but cumbersome to retrieve data, as aforementioned adata.mode['prot'].var wouldn’t work anymore.

fidelram commented 4 years ago

Thanks for the reply. Sounds good to me. Any idea on how difficult/long it would be to implement this?

flying-sheep commented 4 years ago

Thank you for the discussion :smile:

It probably won’t take that long with a plan: We just need to

  1. create a list of public APIs for AnnData
  2. create a class (called, say ModeView) that implements them by deferring to self._parent (or self._modeless since _parent is already taken)
  3. Create a new AlignedMapping aligned to the obs axis (almost like obsm’s class, only using another _view_class)
  4. Write tests that make sure the right properties are shared and the other ones aren’t
  5. Write tests about subsetting the original and the ModeView.

The only question left: Currently, AnnData objects can be “actualized” and “slice views”. This is because slice views can be actualized without creating a new object. I’m not sure how to cleanly solve this with a second kind of view (mode views).

fidelram commented 4 years ago

@flying-sheep With respect to the open question. If we start by the principle that modes are anndata objects that share .obs, those actualizations and slices that involved .obs should propagate to all other modes while anything to do with .var only affects the current mode.

ivirshup commented 4 years ago

Sorry to have missed this discussion! I managed to injure myself and was out of action for most of the last couple weeks 🤕

Would it be fair to say what @flying-sheep is proposing is pretty similar to SingleCellExperiments AltExp? My understanding: This would define a new object which has one master obs axis, but allow multiple options for var axes. You can then pick an axis and get a ModeView which would behave a lot like a AnnData object, but would allow updating the original.

I have a few api questions.

It would be interesting, and powerful, if this could be extended to allow multiple obs "modes" (i.e. multiple datasets) along with multiple modalities. Similar to performing a joint clustering on cells across modalities, can we find a consistent regulatory networks across experiments? Any thoughts on this?

flying-sheep commented 4 years ago

Oh no, I didn’t know you hurt yourself! I hope you are feeling better now! What happened?

Those are good questions and made me think up a few more.

Additional questions:

jlrestrepol commented 4 years ago

This discussion is much appreciated! I'm working in CITE-Seq and up to now we have been storing the proteomic data in .obsm and the protein names in .uns and as it was pointed out here is messy and ugly. I ended up creating a new adata object with proteomic data and handling it independently from the genetic data. Working like this makes it harder to do a combined analysis like plotting marker genes and marker proteins of the same cell type. This would be solved having both adata objects as a part of a bigger abstract object and sharing .obs and .obsm.

Also, allowing Data Frames in .obsm is very useful in our case, we can easily plot the complete information in the same figure, however we are still not able to calculate embeddings or other representations.

ivirshup commented 4 years ago

@jlrestrepol

Thanks for the feedback. What you're doing sounds a lot like what I've been doing on the master branch. Once data frames are in obsm, I would recommend still using two anndatas for all preprocessing, then something like rna.obsm["surface_markers"] = prot.to_df() to help with plotting.

@flying-sheep, a bit more feedback and ideas.

Oh no, I didn’t know you hurt yourself! I hope you are feeling better now! What happened?

Really boring accident, just ended up with some stitches and a broken nose. Worst part was that I couldn't wear my glasses so couldn't do much.

We have to think of a rigorous way to disambiguate here if there’s names appearing in different modes or so.

Yeah, there will definitely be name collisions here. I think my idea for accesor objects could help here. I think I'd definitely like them to include a shorthand for access. I just wish we could define our own string literals to make this more concise.

The inter-mode graph came up already. I think we need that. Maybe when supplying a 2D index?

How indices for multiple labelled dimensions is an interesting problem. I'm aware of a few different approaches being taken by packages like Xarray, DimensionalArrays.jl, and AxisArrays.jl. I'm going to look into this more.

This would only allow one graph per pair, but that’s probably all we need.

I think we have to allow multiple. An object with .layers is more useful than one with just .X, right?

What happens with modes if you transpose your AnnData object if we don’t have obs modes? Will it just raise an error?

I'm don't think transpose is well defined here. Even if there are obs "modes", It's not obvious to me what a transpose means in this context. Personally I don't like that anndata has a transpose right now. For one since it totally breaks .raw.

flying-sheep commented 4 years ago

I think we have to allow multiple. An object with .layers is more useful than one with just .X, right?

So if we do that, ad.mode['x', 'y'] returns a AlignedArrays object aligned to ad.mode['x'].var_names and ad.mode['y'].var_names? So you can do ad.mode[None, 'prot']['splice'] and get a graph relating transcripts (ad.var_names) to proteins (ad.mode['prot'].var_names)?

fidelram commented 4 years ago

is anyone from epi-scanpy following this issue?

flying-sheep commented 4 years ago

ping @DaneseAnna

DaneseAnna commented 4 years ago

Ok, let's me catch up! ping @kridsadakorn

flying-sheep commented 4 years ago

Hi, I had a fruitful discussion with @DaneseAnna, @b-schubert, @le-ander and a few more

LuckyMD commented 4 years ago

I reckon a really nice workaround for the mode-specific obsm question, would be to just map adata.mode.obsm['X_pca'] to adata.obsm['X_pca_{mode_name}']. Then you have an easy way to access everything in the same place, and you have a useful shorthand for ease-of-use.

flying-sheep commented 4 years ago

Yes, that’s the first idea I mentioned to solve this :wink:

LuckyMD commented 4 years ago

ah, okay. I understood your suggestion as an either/or.

ivirshup commented 4 years ago

What to do if multiple modes pass quality filtering differently? ... a obs-mask per mode that is only respected by some algorithms?

I think this is the way to go. Deciding which algorithm uses which mask can be done via default based on the mode being accessed, or a keyword argument like obs_mask=.... I'd argue this should happen with var as well to generalize "highly_variable".

A possibility would be a ChainMap(mode.obsm, mode.parent_adata.obsm), but that might be to complex.

I think something like a ChainMap would be the right way to do this, but the ChainMap might not have the exact update mechanics we want.

flying-sheep commented 4 years ago

Yeah, it’ll be quite complex to understand, not so much to implement. A good repr would help:

>>> adata.mode['prot'].obsm
AxisArrays aligned to the .obs axis with
        own keys: X_pca, X_tsne
  inherited keys: X_diffmap
   shadowed keys: X_pca (both owned and inherited, owned version preferred)
adamgayoso commented 4 years ago

This looks interesting, are there any updates on implementing this? With totalVI we opted to put the protein expression in adata.obsm["protein_expression"] and names in .uns. It's not a very elegant solution.

scottgigante commented 4 years ago

Bumping this. Would be extremely useful to have a unified API for this.

dweemx commented 3 years ago

Hi, AnnData is a great format to work with! We would like to use AnnData as our default data format. However this multi-modal feature is quite an important feature that we would require. I was wondering whether there is any ETA for having this feature implemented AnnData ?

grst commented 3 years ago

Hi all,

interesting discussion! I like the adata.mode["prot"].{X,var,...} API and also its context-manager variant.

As far as I understood, you are assuming that obs is shared between all modes. However, when working with immune receptor (IR) data, it would be very "clean" to have the IR-related obs columns in a separate "mode" than the transcriptome-related columns. Both modes would have the same n_obs, though.

In scirpy, we currently store all IR-related information (such as VDJ-genes) in obs for easy plotting with scanpy. However, this clutters obs with many automatically generated columns (see example below). Therefore, for this usecase it would be nice to have separate adata.obs and adata.mode["ir"].obs.

In this context, parsing strings for explicit access to the different modes while plotting also sounds like a great idea:

sc.pl.umap(adata, color=["patient", "ir:has_ir"])

>>> adata.obs.columns
Index(['cluster_orig', 'patient', 'sample', 'source', 'clonotype_orig',
       'multi_chain', 'IR_VJ_1_locus', 'IR_VJ_2_locus', 'IR_VDJ_1_locus',
       'IR_VDJ_2_locus', 'IR_VJ_1_cdr3', 'IR_VJ_2_cdr3', 'IR_VDJ_1_cdr3',
       'IR_VDJ_2_cdr3', 'IR_VJ_1_junction_ins', 'IR_VJ_2_junction_ins',
       'IR_VDJ_1_junction_ins', 'IR_VDJ_2_junction_ins', 'IR_VJ_1_expr',
       'IR_VJ_2_expr', 'IR_VDJ_1_expr', 'IR_VDJ_2_expr', 'IR_VJ_1_v_gene',
       'IR_VJ_2_v_gene', 'IR_VDJ_1_v_gene', 'IR_VDJ_2_v_gene',
       'IR_VJ_1_d_gene', 'IR_VJ_2_d_gene', 'IR_VDJ_1_d_gene',
       'IR_VDJ_2_d_gene', 'IR_VJ_1_j_gene', 'IR_VJ_2_j_gene',
       'IR_VDJ_1_j_gene', 'IR_VDJ_2_j_gene', 'IR_VJ_1_c_gene',
       'IR_VJ_2_c_gene', 'IR_VDJ_1_c_gene', 'IR_VDJ_2_c_gene',
       'IR_VJ_1_cdr3_nt', 'IR_VJ_2_cdr3_nt', 'IR_VDJ_1_cdr3_nt',
       'IR_VDJ_2_cdr3_nt', 'has_ir', 'batch'],
      dtype='object')
gokceneraslan commented 3 years ago

Just heard about dandelion (https://github.com/zktuong/dandelion/) on Twitter, looping in @zktuong too since he possibly thought about dandelion-scanpy integration.

zktuong commented 3 years ago

Hi, +1 from me to the modes API as well and I can see how I can fit dandelion's slots to that system. Being a complete novice, I went about the long way to create a different class object to handle the bits I wanted to acheive with dandelion and then just transfer to the .obs slot afterwards. I have the same issue as with @grst with the whole chunk of data that comes from immune receptors, and the format i'm adhering to is the one imposed by the Adaptive Immune Receptor Repertoire community standards for interoperability and consistency, but it may some times contain 100s of columns and not every column is going to be useful for every step. So a second data frame within the mode would be great i think.

dawe commented 3 years ago

Hi all, adding modes would be great, I stand for it as well!

  • What to do if multiple modes pass quality filtering differently? E.g. we lose cells x, y, z in the proteome but cells z, a, b in the transcriptome – x, y are however fine for transcriptome analysis and a, b for proteome analysis.

I wouldn't enforce cells to be shared across modalities. The set of cell IDs can be equal, intersect but in principle they can be completely disjoint (say a scRNA-seq experiment + scATAC-seq experiment on two splits of the same sample). Each modality will have their own filters. Of course, one may add a function to retain only shared cells. Moreover, as there will be more than 2 modalities, there will be no guarantee that a cell will survive in all of them

We need some way to allow keeping them. How? all-NA rows? a obs-mask per mode that is only respected by some algorithms? I don’t think .raw is the answer.

  • People want mode-specific .obsm. (E.g. a PCA for each mode). We could handle that by automatically appending f'_{mode_name}' to each basis (e.g. 'X_pca_prot') or by giving every mode its own .obs{,m,p}.

Wouldn't be better to think to AnnData as a hierarchical container in which each mode is a AnnData in the way you intend it today? I believe hdf5 supports that, also in that case you don't need to fill with NA or to name things differently. You only need some high-level function that is able to retrieve data from a specific mode (pretty much like you do with layers today)

An example for a multi-modal algorithm would be Stegle Labs’ MOFA. Another example is e.g. plotting stuff, e.g. transcript against protein: sc.pl.scatter(adata, x='CD8A', y='prot:CD8')

Nice! If modes will be available, MOFA+ could enter in the sc.external tools:

from mofapy2.run.entry_point import entry_point
import scanpy as sc

shared_cells = sc.tl.get_shared_cells(adata, modes=['rna', 'meth'])

ent = entry_point()
D = [adata.mode['rna'], adata.mode['meth']]
M = 2
K = 20
N = [len(shared_cells)]
G = 1

data_mat = [[None for g in range(G)] for m in range(M)]

data_mat[0][0] = adata.mode['rna'].raw.X.toarray()
data_mat[1][0] = adata_mode['meth']raw.X.toarray()

ent.set_data_options(…)

ent.set_data_matrix(data_mat, 
    likelihoods = …,
    views_names=adata.mode_keys(), 
    samples_names=[shared_cells],
    features_names=[adata.mode['rna'].var_names, adata.mode['meth'].var_names] 
)

ent.set_model_options(…)
ent.set_train_options(…)

ent.build()
ent.run()

#assuming if mode is not specified, is referred to object shared across modalities
adata.obsm['X_mofa'] = ent['expectations']['Z']['group0'].value
rcannood commented 3 years ago

@flying-sheep @ivirshup Any updates on the mode API? :eyes:

gtca commented 3 years ago

Since this issue is coming up in relevant discussions, I'd mention — to the benefit of the reader — a recently introduced multimodal datatype, MuData, that builds on AnnData for individual modalities and has an AnnData-like interface itself (e.g. .obs, .var, .obsm attributes, etc.). It does store modalities in its .mod attribute (e.g. adata = mdata.mod["rna"]) and should address quite a few points from this discussion.

This format is currently implemented as a part of the muon framework for multimodal omics data analysis. For the CITE-seq analysis itself, which this thread has started with, there are a few relevant notebooks for muon.

With respect to the anndata development, this raises a question if we might want to rather encourage the use of MuData for multimodal datasets rather than rewriting AnnData to handle multimodal designs.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!

ivirshup commented 1 year ago

Quite the issue!

While there were a ton of valuable ideas brought up in this thread, I'm going to mark this as completed since we now have MuData.