Closed gokceneraslan closed 1 year ago
Maybe it's important to separate ADT (cite-seq) and HTO (hashing) cases because ADT is more in line with multimodal analysis which also includes protein+rna or atac+rna (e.g. https://www.nature.com/articles/s41587-019-0290-0). Scanpy's behaviour for multimodality is undefined since computational pipelines are not established. But the hashing case is very relevant for Scanpy because it enables the analysis of multiplexed samples.
In git master, pandas DataFrames are allowed as obsm entries. Check out #228 for a discussion.
I’m inclined to allow DataFrames there as long as they’re homogeneous, (inhomogeneous ones can be represented as hierarchical column index in obs
, right?)
Oh ok, I didn't know about the discussion at all :) That's super cool, I wrote a comment there.
@gokceneraslan I like your idea of storing HTO/ADT into .obsm
as this seems to almost ready to be used. However, as you point out, there are different use cases that we may want to consider to see if .obsm
can accommodate all the functionality required.
.obsm
will not be useful for this as the current methods need the data in adata.X
..obsm
, and the methods to do the demultiplexing can be directed to find the data there.In general, what seems to be missing in scanpy is a placeholder for matrices that contain the same number of cells but different number of variables. Currently, the layers
address the case of same cells, same genes. The missing relationship should be similar the the one between adata
and adata.raw
but extended to multiple cases, for example adata.citeseq
, adata.atac
etc. In Seurat, I think this is implemented as assays
.
Under the above scenario, I could envision that something like this should be possible (extrapolating from current adata.raw
functionality in which adata.obs
is shared):
sc.tl.leiden(adata.citeseq, key_added='citeseq_clustering')
This should use the matrix and neighbors information contained in adata.citeseq
, which itself is an AnnData object (as is already the case for adata.raw
), and update adata.obs
by adding the citeseq_clusterin
column.
Implementing something like this does not look like a low hanging fruit but given the code already available for adata.raw
maybe the possibility of such extension can be investigated.
I’m currently thinking about multimodal data. I think adding them to the count matrix is a bad idea. Instead I’m thinking about this:
X
will become property(lambda self: self.layers[None])
adata_prot = adata.mode('prot')
.uns[mode][...]
?As is done with layers in parts, all functions gain a parameter to specify which layer to operate on.
Cumbersome and needs a lot of changes. No.
Make AnnData stateful: adata.mode = 'prot'
Flaky, hard to debug. An assignment that changes the shape? Weird! The wrapper/view idea seems like a straight upgrade!
I like the ideas in this discussion! While I've definitely got some thoughts on this, I wanted to point to how other people were dealing with this problem – so we can think about the pros and cons of their approaches:
scater
: https://github.com/davismcc/scater/pull/93Brief, general questions:
As is done with layers in parts, all functions gain a parameter to specify which layer to operate on.
Why do you think this would be too painful? I've played around with this a bit, and think it's promising. In particular I think it reduces a lot of complexity by establishing some shared code for argument handling.
About naming: Seurat calls its different modes “assays” SCE calls its layers “assays”. Confusing! I don’t want us to contribute to confusion, but why does nobody call this the way to store multimodal data “modes” outside of this issue? altExperiment
is cumbersome, sidecar
is cute.
Should we account for relationships across modalities? E.g. connect the transcript expression to the protein?
Good question. I’d assume the relationship here may even be a many-to-many graph for each pair of modes, right?
Why do you think this would be too painful?
We’d have to modify every single function, which is reason enough. If we do it my way, 3rd party code will just work instead of us having to bug them to change it. Also it’s easier to miss a mode='CITE'
deep in the param list than using the wrong object as first parameter.
@flying-sheep I support your arguments against something like layers.
I don't understand exactly what you mean with making AnnData stateful. I guess that by setting adata.mode = 'prot'
your suggestion is that this will set adata.X
to contain 'prot' data (thus changing the shape) and move whatever was in adata.X
, adata.var
, adata.varm
etc. to for example a RNA
container? If that is the case, I also don't like it, especially because the mode is hidden away.
With respect to your preferred suggestion, why do you call it a 'wrapper'? I think you have something quite clear in mind already on how to implement this. I like the idea of using the word mode
. That would be my preferred naming.
With respect to your question about what is global and what is mode specific I would say that anything related to obs
and obsm
is global because that's the axis shared by all modes. Otherwise, each mode should have its own .uns
, .var
, .var_names
Under your proposal, the following statements should be equivalent?
sc.tl.leiden(adata.mode['prot'], key_added='prot_clustering')
# now adata.obs has 'prot_clustering'
adata_prot = adata.mode('prot')
sc.tl.leiden(adata_prot, key_added='prot_clustering')
# now adata.obs has 'prot_clustering'
With respect to your preferred suggestion, why do you call it a 'wrapper'?
The idea is that it isn’t an AnnData
object, it just behaves like one. All operations regarding obs
go directly to a parent AnnData
object, with the difference that shape
, X
, layers
, var{,m,p}
, are for a different mode.
Our API is already pretty big so I don’t think we should make different modes’ .var
, … available directly via something like adata.mode_var['prot]
or so. I think adata.mode['prot'].var
is clean enough.
Under your proposal, the following statements should be equivalent?
Hmm, That’s the same code, except that one uses __getitem__
and the other __call__
. I’m open on which one to use, but it shouldn’t be both. I was also thinking about only allowing the following, but I think it might be cumbersome? Maybe it’s clearer :thinking:
with adata.mode('prot') as adata_prot:
sc.tl.leiden(adata_prot, key_added='prot_clustering')
That would force the use of alternative modes to be temporary, which would be clean for operations, but cumbersome to retrieve data, as aforementioned adata.mode['prot'].var
wouldn’t work anymore.
Thanks for the reply. Sounds good to me. Any idea on how difficult/long it would be to implement this?
Thank you for the discussion :smile:
It probably won’t take that long with a plan: We just need to
ModeView
) that implements them by deferring to self._parent
(or self._modeless
since _parent
is already taken)AlignedMapping
aligned to the obs
axis (almost like obsm
’s class, only using another _view_class
)ModeView
.The only question left: Currently, AnnData objects can be “actualized” and “slice views”. This is because slice views can be actualized without creating a new object. I’m not sure how to cleanly solve this with a second kind of view (mode views).
@flying-sheep With respect to the open question. If we start by the principle that modes are anndata objects that share .obs
, those actualizations and slices that involved .obs
should propagate to all other modes while anything to do with .var
only affects the current mode.
Sorry to have missed this discussion! I managed to injure myself and was out of action for most of the last couple weeks 🤕
Would it be fair to say what @flying-sheep is proposing is pretty similar to SingleCellExperiments
AltExp
? My understanding: This would define a new object which has one master obs
axis, but allow multiple options for var
axes. You can then pick an axis and get a ModeView
which would behave a lot like a AnnData object, but would allow updating the original.
I have a few api questions.
leiden
or are these networks just objects in obsp
now?It would be interesting, and powerful, if this could be extended to allow multiple obs
"modes" (i.e. multiple datasets) along with multiple modalities. Similar to performing a joint clustering on cells across modalities, can we find a consistent regulatory networks across experiments? Any thoughts on this?
Oh no, I didn’t know you hurt yourself! I hope you are feeling better now! What happened?
Those are good questions and made me think up a few more.
Good point about mixing modes in plots. I think since plotting functions already search .obs.columns
and .var_names
, it would make sense they also search the modes.
We have to think of a rigorous way to disambiguate here if there’s names appearing in different modes or so.
The inter-mode graph came up already. I think we need that. Maybe when supplying a 2D index? ad.modes[None, 'prot']
? This would only allow one graph per pair, but that’s probably all we need. (And one could still do a 2D structured array here to store multiple edge properties)
Algorithms operating on multiple modes have to be mode-aware. If the result can easily stored in obs*
(shared between modes), there’s no problem right?
Additional questions:
obs*
entries that are specific to some var
, like size factors derived from the transcriptome?AnnData
object if we don’t have obs modes? Will it just raise an error?ad.modes['prot'] = AnnData(...)
and raise an error if obs
there is anything but empty.This discussion is much appreciated! I'm working in CITE-Seq and up to now we have been storing the proteomic data in .obsm and the protein names in .uns and as it was pointed out here is messy and ugly. I ended up creating a new adata object with proteomic data and handling it independently from the genetic data. Working like this makes it harder to do a combined analysis like plotting marker genes and marker proteins of the same cell type. This would be solved having both adata objects as a part of a bigger abstract object and sharing .obs and .obsm.
Also, allowing Data Frames in .obsm is very useful in our case, we can easily plot the complete information in the same figure, however we are still not able to calculate embeddings or other representations.
@jlrestrepol
Thanks for the feedback. What you're doing sounds a lot like what I've been doing on the master branch. Once data frames are in obsm, I would recommend still using two anndatas for all preprocessing, then something like rna.obsm["surface_markers"] = prot.to_df()
to help with plotting.
@flying-sheep, a bit more feedback and ideas.
Oh no, I didn’t know you hurt yourself! I hope you are feeling better now! What happened?
Really boring accident, just ended up with some stitches and a broken nose. Worst part was that I couldn't wear my glasses so couldn't do much.
We have to think of a rigorous way to disambiguate here if there’s names appearing in different modes or so.
Yeah, there will definitely be name collisions here. I think my idea for accesor objects could help here. I think I'd definitely like them to include a shorthand for access. I just wish we could define our own string literals to make this more concise.
The inter-mode graph came up already. I think we need that. Maybe when supplying a 2D index?
How indices for multiple labelled dimensions is an interesting problem. I'm aware of a few different approaches being taken by packages like Xarray, DimensionalArrays.jl, and AxisArrays.jl. I'm going to look into this more.
This would only allow one graph per pair, but that’s probably all we need.
I think we have to allow multiple. An object with .layers
is more useful than one with just .X
, right?
What happens with modes if you transpose your AnnData object if we don’t have obs modes? Will it just raise an error?
I'm don't think transpose is well defined here. Even if there are obs
"modes", It's not obvious to me what a transpose means in this context. Personally I don't like that anndata has a transpose right now. For one since it totally breaks .raw
.
I think we have to allow multiple. An object with .layers is more useful than one with just .X, right?
So if we do that, ad.mode['x', 'y']
returns a AlignedArrays
object aligned to ad.mode['x'].var_names
and ad.mode['y'].var_names
? So you can do ad.mode[None, 'prot']['splice']
and get a graph relating transcripts (ad.var_names
) to proteins (ad.mode['prot'].var_names
)?
is anyone from epi-scanpy following this issue?
ping @DaneseAnna
Ok, let's me catch up! ping @kridsadakorn
Hi, I had a fruitful discussion with @DaneseAnna, @b-schubert, @le-ander and a few more
What to do if multiple modes pass quality filtering differently? E.g. we lose cells x, y, z in the proteome but cells z, a, b in the transcriptome – x, y are however fine for transcriptome analysis and a, b for proteome analysis.
We need some way to allow keeping them. How? all-NA rows? a obs-mask per mode that is only respected by some algorithms? I don’t think .raw
is the answer.
People want mode-specific .obsm
. (E.g. a PCA for each mode). We could handle that by automatically appending f'_{mode_name}'
to each basis (e.g. 'X_pca_prot'
) or by giving every mode its own .obs{,m,p}
.
A possibility would be a ChainMap(mode.obsm, mode.parent_adata.obsm)
, but that might be to complex.
While the whole idea about this is that AnnData-eating algorithms should be able to work on modes transparently, we need to thing about cross-mode APIs.
An example for a multi-modal algorithm would be Stegle Labs’ MOFA. Another example is e.g. plotting stuff, e.g. transcript against protein: sc.pl.scatter(adata, x='CD8A', y='prot:CD8')
I reckon a really nice workaround for the mode-specific obsm
question, would be to just map adata.mode.obsm['X_pca']
to adata.obsm['X_pca_{mode_name}']
. Then you have an easy way to access everything in the same place, and you have a useful shorthand for ease-of-use.
Yes, that’s the first idea I mentioned to solve this :wink:
ah, okay. I understood your suggestion as an either/or.
What to do if multiple modes pass quality filtering differently? ... a obs-mask per mode that is only respected by some algorithms?
I think this is the way to go. Deciding which algorithm uses which mask can be done via default based on the mode being accessed, or a keyword argument like obs_mask=...
. I'd argue this should happen with var
as well to generalize "highly_variable"
.
A possibility would be a ChainMap(mode.obsm, mode.parent_adata.obsm), but that might be to complex.
I think something like a ChainMap
would be the right way to do this, but the ChainMap
might not have the exact update mechanics we want.
Yeah, it’ll be quite complex to understand, not so much to implement. A good repr would help:
>>> adata.mode['prot'].obsm
AxisArrays aligned to the .obs axis with
own keys: X_pca, X_tsne
inherited keys: X_diffmap
shadowed keys: X_pca (both owned and inherited, owned version preferred)
This looks interesting, are there any updates on implementing this? With totalVI we opted to put the protein expression in adata.obsm["protein_expression"]
and names in .uns
. It's not a very elegant solution.
Bumping this. Would be extremely useful to have a unified API for this.
Hi, AnnData
is a great format to work with! We would like to use AnnData
as our default data format. However this multi-modal feature is quite an important feature that we would require.
I was wondering whether there is any ETA for having this feature implemented AnnData
?
Hi all,
interesting discussion! I like the adata.mode["prot"].{X,var,...}
API and also its context-manager variant.
As far as I understood, you are assuming that obs
is shared between all modes. However, when working with immune receptor (IR) data, it would be very "clean" to have the IR-related obs
columns in a separate "mode" than the transcriptome-related columns. Both modes would have the same n_obs
, though.
In scirpy, we currently store all IR-related information (such as VDJ-genes) in obs
for
easy plotting with scanpy. However, this clutters obs
with many automatically generated columns (see example below). Therefore, for this usecase it would be nice to have separate adata.obs
and adata.mode["ir"].obs
.
In this context, parsing strings for explicit access to the different modes while plotting also sounds like a great idea:
sc.pl.umap(adata, color=["patient", "ir:has_ir"])
>>> adata.obs.columns
Index(['cluster_orig', 'patient', 'sample', 'source', 'clonotype_orig',
'multi_chain', 'IR_VJ_1_locus', 'IR_VJ_2_locus', 'IR_VDJ_1_locus',
'IR_VDJ_2_locus', 'IR_VJ_1_cdr3', 'IR_VJ_2_cdr3', 'IR_VDJ_1_cdr3',
'IR_VDJ_2_cdr3', 'IR_VJ_1_junction_ins', 'IR_VJ_2_junction_ins',
'IR_VDJ_1_junction_ins', 'IR_VDJ_2_junction_ins', 'IR_VJ_1_expr',
'IR_VJ_2_expr', 'IR_VDJ_1_expr', 'IR_VDJ_2_expr', 'IR_VJ_1_v_gene',
'IR_VJ_2_v_gene', 'IR_VDJ_1_v_gene', 'IR_VDJ_2_v_gene',
'IR_VJ_1_d_gene', 'IR_VJ_2_d_gene', 'IR_VDJ_1_d_gene',
'IR_VDJ_2_d_gene', 'IR_VJ_1_j_gene', 'IR_VJ_2_j_gene',
'IR_VDJ_1_j_gene', 'IR_VDJ_2_j_gene', 'IR_VJ_1_c_gene',
'IR_VJ_2_c_gene', 'IR_VDJ_1_c_gene', 'IR_VDJ_2_c_gene',
'IR_VJ_1_cdr3_nt', 'IR_VJ_2_cdr3_nt', 'IR_VDJ_1_cdr3_nt',
'IR_VDJ_2_cdr3_nt', 'has_ir', 'batch'],
dtype='object')
Just heard about dandelion (https://github.com/zktuong/dandelion/) on Twitter, looping in @zktuong too since he possibly thought about dandelion-scanpy integration.
Hi, +1 from me to the modes API as well and I can see how I can fit dandelion's slots to that system. Being a complete novice, I went about the long way to create a different class object to handle the bits I wanted to acheive with dandelion and then just transfer to the .obs
slot afterwards. I have the same issue as with @grst with the whole chunk of data that comes from immune receptors, and the format i'm adhering to is the one imposed by the Adaptive Immune Receptor Repertoire community standards for interoperability and consistency, but it may some times contain 100s of columns and not every column is going to be useful for every step. So a second data frame within the mode would be great i think.
Hi all, adding modes would be great, I stand for it as well!
- What to do if multiple modes pass quality filtering differently? E.g. we lose cells x, y, z in the proteome but cells z, a, b in the transcriptome – x, y are however fine for transcriptome analysis and a, b for proteome analysis.
I wouldn't enforce cells to be shared across modalities. The set of cell IDs can be equal, intersect but in principle they can be completely disjoint (say a scRNA-seq experiment + scATAC-seq experiment on two splits of the same sample). Each modality will have their own filters. Of course, one may add a function to retain only shared cells. Moreover, as there will be more than 2 modalities, there will be no guarantee that a cell will survive in all of them
We need some way to allow keeping them. How? all-NA rows? a obs-mask per mode that is only respected by some algorithms? I don’t think
.raw
is the answer.
- People want mode-specific
.obsm
. (E.g. a PCA for each mode). We could handle that by automatically appendingf'_{mode_name}'
to each basis (e.g.'X_pca_prot'
) or by giving every mode its own.obs{,m,p}
.
Wouldn't be better to think to AnnData as a hierarchical container in which each mode is a AnnData in the way you intend it today? I believe hdf5 supports that, also in that case you don't need to fill with NA or to name things differently. You only need some high-level function that is able to retrieve data from a specific mode (pretty much like you do with layers today)
An example for a multi-modal algorithm would be Stegle Labs’ MOFA. Another example is e.g. plotting stuff, e.g. transcript against protein:
sc.pl.scatter(adata, x='CD8A', y='prot:CD8')
Nice! If modes will be available, MOFA+ could enter in the sc.external tools:
from mofapy2.run.entry_point import entry_point
import scanpy as sc
shared_cells = sc.tl.get_shared_cells(adata, modes=['rna', 'meth'])
ent = entry_point()
D = [adata.mode['rna'], adata.mode['meth']]
M = 2
K = 20
N = [len(shared_cells)]
G = 1
data_mat = [[None for g in range(G)] for m in range(M)]
data_mat[0][0] = adata.mode['rna'].raw.X.toarray()
data_mat[1][0] = adata_mode['meth']raw.X.toarray()
ent.set_data_options(…)
ent.set_data_matrix(data_mat,
likelihoods = …,
views_names=adata.mode_keys(),
samples_names=[shared_cells],
features_names=[adata.mode['rna'].var_names, adata.mode['meth'].var_names]
)
ent.set_model_options(…)
ent.set_train_options(…)
ent.build()
ent.run()
#assuming if mode is not specified, is referred to object shared across modalities
adata.obsm['X_mofa'] = ent['expectations']['Z']['group0'].value
@flying-sheep @ivirshup Any updates on the mode API? :eyes:
Since this issue is coming up in relevant discussions, I'd mention — to the benefit of the reader — a recently introduced multimodal datatype, MuData
, that builds on AnnData
for individual modalities and has an AnnData
-like interface itself (e.g. .obs
, .var
, .obsm
attributes, etc.). It does store modalities in its .mod
attribute (e.g. adata = mdata.mod["rna"]
) and should address quite a few points from this discussion.
This format is currently implemented as a part of the muon
framework for multimodal omics data analysis.
For the CITE-seq analysis itself, which this thread has started with, there are a few relevant notebooks for muon
.
With respect to the anndata
development, this raises a question if we might want to rather encourage the use of MuData
for multimodal datasets rather than rewriting AnnData
to handle multimodal designs.
This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!
Quite the issue!
While there were a ton of valuable ideas brought up in this thread, I'm going to mark this as completed since we now have MuData.
In cell hashing or CITE-seq datasets, an additional count matrix is produced and it stores how many ADT/HTO barcodes (see this for details https://genomebiology.biomedcentral.com/track/pdf/10.1186/s13059-018-1603-1) a cell has. Then depending on the experiment type we use these barcodes to either demultiplex cells into their original "sources" (HTO case) or to quantify protein expression (ADT case). Here is the PBMC HTO file from Seurat tutorial (rows are sample names i.e. HTO barcode IDs, columns are cells):
Link to file: https://www.dropbox.com/sh/c5gcjm35nglmvcv/AABGz9VO6gX9bVr5R2qahTZha?dl=0&preview=pbmc_hto_mtx.rds
Right now, we can store these counts in
adata.obsm
as @wflynny suggested, however there is no place to store barcode strings (or protein names or sample names depending on what barcodes represent) inadata.obsm
. One hack would be to store them in adata.uns but that'd be very ugly. Alternatively one can store everything inadata.obs
but that'd also pollute the obs and ignore the multivariate nature of the barcodes.What would be a good solution here? Would it make sense to add "column names" to an
adata.obsm
? Since they're currently stored as numpy arrays this seems infeasible but how much effort is it to store them as dataframes (like obs) instead of matrices? Alternatively,sc.get
might allow us to access a group of columns inadata.obs
for convenience.Let me know what you think. (Btw, related discussion: https://github.com/theislab/scanpy/issues/351)