Closed grst closed 1 year ago
Hi All, I am active in the AIRR Standards group, can try and answer questions. We are looking at integrating scirpy into the iReceptor Gateway (gateway.ireceptor.org) to do analysis on AIRR Cell data - so have a vested interest... 8-)
The AIRR Cell schema was just announce at the AIRR meeting this week (https://www.antibodysociety.org/the-airr-community/meetings/airr-community-meeting-vi-exploring-new-frontiers/), with a tagged (v1.4) release expected on github in the next few weeks. This is an "Experimental" schema, meaning that we think it is pretty stable, but will probably need some minor changes before it is finalized, probably in an AIRR Standard v2.0 release sometime in the "near" future.
The Cell schema is intended to represent a Cell
as an object. It is a light weight object, and contains a cell_id
that is used to cross reference information in other collections such as Rearrangement
and CellExpression
. So if you have a cell_id
from a Cell
you can find all of the Rearrangements
for the paired chains associated with the cell in an AIRR TSV file by searching for that same cell_id
. Similarly you can find all of the CellExpression
information by searching a CellExpression file for that same cell_id
.
Here are the current docs on the master branch, which will be tagged with a v1.4 release soon.
One of the outstanding things we need to resolve in the AIRR Standard is what if any "efficient" file format should be supported. Currently the default standard for all AIRR objects is a JSON representation of the standard objects. For Rearrangement
there is a defined TSV file. Such a file format does not yet exist for Cell
and more importantly CellExpression
. We could probably use some advice from this community...
We have an about to be release version of our AIRR repository (https://github.com/sfu-ireceptor/turnkey-service-php/tree/production-v4) that can store Cell and CellExpression data, but currently what you get out is a very big, verbose JSON file. It works, but...
BTW, I think it would be interesting to decorate a cell's adata.obs with AIRR Repertoire
metadata. That is, it would be quite easy (I think) to decorate the adata.obs with all sorts of subject/sample metadata for each cell such as subject.disease.disease_daignosis
, subject.age_min
, subject.sex
, sample.tissue
etc so that one can use these criteria for coloring, group, etc as well. It seems to me that these fields in adata.obs would be more valuable than the most of the Rearrangement
fields. I could see wanting some Rearrangement
fields, maybe VDJ calls, CDR3 length, locus for the paired chains, but it seems like most of the Rearrangement
fields would not be very interesting from a slicing/coloring/visualization point of view.
Hi @bcorrie,
thanks for reaching out, I'm definitely interested in working with the AIRR community to make this as interoperable as possible.
Your description of the Cell
schema is really helpful, before I wasn't sure it was to replace the Rearrangement
schema, but now I understood it's built on top of it.
Actually scirpy internally uses something very very similar to the Cell schema already. Every file format read into scirpy is first represented as a list of AirrCell
objects that have a cell_id
, a list of Rearrangement
entries and arbitrary cell-level attributes that are added as columns to adata.obs
. Once the Cell
schema is final, I could simply adjust my AirrCell
implementation to match the cell-schema, which would make reading and writing AIRR-compliant data trivial.
The internal representation is one side of the medal, the other is representing the AIRR data as part of either an AnnData
or MuData
container, which allows to link the data to other modalities, allows efficient file storage in h5ad
/h5mu
and enables the interaction with scanpy or other scverse tools. The current way is to dump everything in .obs
which is not ideal.
Such a file format does not yet exist for Cell and more importantly CellExpression. We could probably use some advice from this community...
Well, how about AnnData
? If you are looking for a lower level of abstraction, I think hdf5, zarr or tiledb can handle these sparse data efficiently, but @ivirshup or @gtca are the experts here. Is it really necessary to define yet another format for single-cell expression?
Note that reading/writing gene expression formats would be out-of-scope for scirpy, this should be handled by scanpy or AnnData, or possibly by a future scverse IO package.
It's also worth noting here that the CZI is currently looking into standardizing Feature Observation Matrices (FOM) for single-cell data. Maybe it would be worth for the AIRR community and them getting in touch for talking about a AIRR-specific extension. Their public presence is still very sparse, but maybe @ivirshup can tell more about this initiative and make the contact.
Best, Gregor
Such a file format does not yet exist for Cell and more importantly CellExpression. We could probably use some advice from this community...
Well, how about
AnnData
? If you are looking for a lower level of abstraction, I think hdf5, zarr or tiledb can handle these sparse data efficiently, but @ivirshup or @gtca are the experts here. Is it really necessary to define yet another format for single-cell expression?
Yes, AnnData seems logical and along the lines of what I was thinking... Definitely do not want YAF (yet another format) 8-)
Like you, this is likely out of scope for the AIRR Community - at least in the sense that the obvious thing for us to do would be to have the AIRR python library (https://github.com/airr-community/airr-standards/tree/master/lang/python) use scanpy to read and write these formats.
The CZI FOM is interesting as well...
Actually scirpy internally uses something very very similar to the Cell schema already. Every file format read into scirpy is first represented as a list of
AirrCell
objects that have acell_id
, a list ofRearrangement
entries and arbitrary cell-level attributes that are added as columns toadata.obs
. Once theCell
schema is final, I could simply adjust myAirrCell
implementation to match the cell-schema, which would make reading and writing AIRR-compliant data trivial.
Very nice - one would hope that these would be closely aligned, and hopefully we can now make sure they are completely aligned 8-)
I've been playing around with the draft implementation of awkward array support in AnnData (https://github.com/scverse/anndata/pull/647). This allows to store in .obsm
(or in the future maybe also in .obs
, see the comments there) ragged arrays of mixed types for each cell -- exactely what we need to represent the 1:n relationship of cells with chains.
An "awkward array" can be directly created from a list of AirrRearrangement dictionaries, e.g.
import awkward as ak
import scirpy as ir
airr_cells = ir.io.to_airr_cells(adata)
chains_as_awkarr = ak.Array([cell.chains for cell in airr_cells])
# an array of length 3000 (= number of cells), variable width (= number of chains) and
# AIRR Rearrangement fields on the 3rd dimension
> chains_as_awkarr
<Array [[{c_call: 'TRBC2', ... v_cigar: None}]] type='3000 * var * {"c_call": op...'>
To get, e.g. all junction_aa
sequences of the first chain of each cell, this awkward array can be sliced:
> chains_as_awkarr[:, 0, "junction_aa"]
<Array ['CASSLMRLAGDTQYF', ... 'CATEEYGNKLVF'] type='3000 * string'>
It is therefore straightforward and computationally efficient to retrieve individual elements.
In summary: We simply take AIRR compliant data, put it in an efficient data structure and store it an AnnData. Not a lot new to define from our side
The list of chains does not know the concept of primary and secondary chains. This can be solved by adding an index array:
# the index arrays point to the element in the list of chains
primary_vj = [0, 0, 1, 2, 0, 1, ...]
primary_vdj = [1, 3, 5, 3, 2, ...]
The awkward array can then be sliced
> chains_as_awkarr[:, primary_vj, "junction_aa"]
to retrieve e,g. the junction AA sequences for the primary VJ chains.
The current solution to put everything in obs was amongst others a pragmatic solution to make all AIRR data immediately available to scanpy plotting functions. This is not possible anymore with the proposed new implementation. Arguably, most of those columns are used very rarely for those purposes anyway.
If they are, here are some solutions to do so
ir.get
function to retrieve data from the awkward array (see also #184). The result can be re-assigned to adata.obs
for plotting and/or groupingir.pl.umap
, where color
is retrieved from the awkward array instead of .obs
. The current draft implementation only supports awkward arrays in .obsm
. IMO it would be completely fine to have an .obsm["airr"]
. In the future, it may be possible to have a column .obs["airr"]
.
It was also previously suggested to use mudata as AIRR data can be considered a modality. Since we don't have any data for .X
and .var
, I don't see a clear advantage of using mudata, though.
I'm looking forward to feedback, especially from people that develop ecosystem packages for AIRR analysis (@zktuong, ..., ?)
@bcorrie, this should make it possible that we merely define a mapping how the AIRR standard is represented as AnnData object rather than converting between a scirpy
and an AIRR
representation. Need to hash that out in more detail, but roughly
Cell
metadata -> adata.obs
Rearrangement
chains -> adata.obsm["airr"]
(as awkward array)CellExpression
-> adata.X
or mudata (in case of CITE-seq) The CZI FOM is interesting as well...
@bcorrie from discussion with the SOMA team, the intent was to keep anndata and SOMA (formerly FOM) pretty consistent with each other. Ideally they are the same thing eventually.
Though I haven't heard much on SOMA development recently, so hopefully we are still on the same page here.
Where to store the data
@grst, I've recently heard people wanting to treat their re-arrangment data as a separate modality. E.g. have an AnnData with just re-arrangement info which could then be merged with other modalities. I think the main point here was to avoid having to arbitrarily associate AIRR data with another modality.
I think @bio-la and @crichgriffin could elaborate on this.
I've recently heard people wanting to treat their re-arrangment data as a separate modality.
Wouldn't it anyway work with both anndata and mudata to use .obsm["airr"]
.
# anndata
ir.tl.something(adata)
# anndata, override obsm_key
ir.tl.something(adata, airr_key="ir")
# mudata
ir.tl.something(mdata.mod["AIRR"])
Or would you prefer something like
# mudata only, use `.mod["AIRR"]` by default
ir.tl.something(mdata)
# ... or override default modality
ir.tl.something(mdata, mod="myairr")
I don't remember the details, but one of the issues was with subsetting. Wanting to filter AIRR obs, but not the RNA.
I think it would work with AnnData right now, I think the issue was more about want to use scirpy with a MuData
where RNA and AIRR were in separate modalities? I think this sorta worked, but there were issues with mudata operations.
In the context of multimodal data and using MuData, I like the idea of keeping the AIRR data separate from any specific modality. My current workflow is to store the scirpy anndata object as a separate modality in muon, where I also have rna and adt data, as separate modalities. But I agree that it doesn't make a lot of sense, since the scirpy anndata object is mostly empty.
I agree with @grst that storing airr data could potentially work in both AnnData and MuData by using .obsm["airr"]
except (iirc) in AnnData .obsm
row dimensions match .obs
row dimensions, and I assume the same applies to MuData.
What would you do in the instances of cells not having AIRR data? E.g. in PBMC data, monocytes would not have airr data, but you would want to keep them in your AnnData/MuData.
What would you do in the instances of cells not having AIRR data?
fill the corresponding rows with NaN
s? This is how scirpy also handles this currently.
Although this is a good point in favor of mudata. It would also make ir.pp.merge_with_ir
superfluous.
I think it should be easy to support both. Just need to think what to advertise as default in the tutorial.
all of this sounds awesome. Just have some naive question about how well/fast does it deal with situation when there's >100k+ cells? would chains_as_awkarr
take a while to retrieve the entries?
That's a fair point, I need to try this out, but it should be very fast. Awkward array claims to be similarly scalable as numpy arrays.
Hi All, sorry for the lack of input - both fighting "the big C" and travelling to meetings (I don't recommend trying to both at the same time) the last several weeks.
I am not an AnnData expert - yet - but my observations thus far.
We are experimenting with Conga and it annotates each AnnData cell with heavy and light chain VDJ/CDR3 in the .obs of the object. Other tools like CellTypist also populate the .obs with other data.
This seems messy, and .obsm seems like a good option for adding specific annotations to a cell object on a per "tool" basis. So there might be a 'AIRR' obsm object, but also a 'Conga' and 'CellTypist' object. This would separate these cell annotations cleanly - would that be considered good practice. It seems to make sense to me that the community might encourage this???
2. Getting values for plotting
The current solution to put everything in obs was amongst others a pragmatic solution to make all AIRR data immediately available to scanpy plotting functions. This is not possible anymore with the proposed new implementation. Arguably, most of those columns are used very rarely for those purposes anyway.
Yes, it occurred to me that plotting might be challenging using obsm. I personally think this would be quite important.
I would anticipate using the obsm['airr'] to store not only rearrangement annotations, but also what we call repertoire annotations (by repertoire AIRR means study/subject/sample metadata). For example, it is quite easy for us to generate a pool of cell data from many subjects in a study, possibly across disease conditions (healthy, mild covid, severe covid). We would pool that data and create a single h5ad file. We would want the 'disease_state' and 'subject_id' to be in the AIRR obsm object so we could visualize cells based on these fields. Same for the AIRR 'tissue' field and a range of others potentially.
It seems like a bad idea to have this difficult.
A quick example - this is from a combined Conga/CellTypist analysis, with the .obs annotated with Conga and Cell Typist fields. This is trivial to visualize VDJ annotations per cell next to CellTypist majortiy voting - at the same time - because the columns are added to .obs. If they were in .obsm and the visualizations could not make use of this easily, this would be challenging...
'vb', 'jb', and 'va' are from Conga, while 'majority_voting' is from CellTypist.
I just threw these data sets together today to try and explore this data for our early experiments with CellTypist. So please be patient with the data oddities with TR call against projected B-cells. I have yet to validate anything from this image in terms of it making sense - and any and all errors are mine 8-)
The main point is that these experimentations are easy because the visualizations are easy.
@bcorrie, this should make it possible that we merely define a mapping how the AIRR standard is represented as AnnData object rather than converting between a
scirpy
and anAIRR
representation. Need to hash that out in more detail, but roughly
Cell
metadata ->adata.obs
Rearrangement
chains ->adata.obsm["airr"]
(as awkward array)CellExpression
->adata.X
or mudata (in case of CITE-seq)
This seems to make sense to me as a first pass, although I would probably suggest that adata.obsm['airr'] might contain more than just info about rearrangement chains (see https://github.com/scverse/scirpy/issues/327#issuecomment-1172887702) as AIRR Repertoire metadata (disease_state, tissue, age, sex, ...) on a cell basis will often be very valuable.
One could have adata.obsm['airr_rearrangement'] and adata.obsm['airr_repertoire'] - but maybe too clunky? Another related link that cells potentially have is to "Receptors" - which are known B/T cell receptors that have a specific antigen/epitope specificity (think of the B and T cell specificity info in IEDB). Again, maybe worth capturing.
For what it's worth, on the R side we've been using MultiAssayExperiments with a "rearrangement" experiment that includes the AIRR Rearrangement data for storing multimodal single-cell data that includes AIRR data, GEX, CITE-seq, and/or whatever people dream up. The AIRR assay data (equivalent to .X
) is stored as a BumpyMatrix, specifically a BumpyDataFrameMatrix
. We have a primary key (row names) derived from the locus
field to easily separate out IGH/TRB, but I don't think that's necessary. Sample/cell level metadata goes into the colData (.obs
) as is typical.
If I'm understanding the awkward array correctly, and I may not be, this would be the same as using the "record" array implementation to populate .X
with an awkward array in a mudata object. A pandas DataFrame with multi-indexing seems like the most natural fit for working cellular Rearrangement data (eg, key on something like ['cell_id', 'locus', 'sequence_id']
) and it looks like the conversion from awkward array to multi-indexed DataFrame is trivial. Unfortunately, my python is a bit rusty these days, so I could be misunderstanding.
PS: "We" is not the AIRR Standards WG in this case. I don't think we should have an official opinion on implementation.
.obs
vs. .obsm
vs. .X
)The AIRR assay data (equivalent to .X) is stored as a BumpyMatrix, specifically a BumpyDataFrameMatrix. We have a primary key (row names) derived from the locus field to easily separate out IGH/TRB, but I don't think that's necessary. Sample/cell level metadata goes into the colData (.obs) as is typical.
@javh this sounds pretty much like what I would imagine! Interesting to have the BumpyMatrix in .X
.
In combination with mudata this actually makes a lot of sense and would be consistent to other modalities.
I would probably suggest that adata.obsm['airr'] might contain more than just info about rearrangement chains (see https://github.com/scverse/scirpy/issues/327#issuecomment-1172887702) as AIRR Repertoire metadata (disease_state, tissue, age, sex, ...) on a cell basis will often be very valuable.
Other tools like CellTypist also populate the .obs with other data. This seems messy, and .obsm seems like a good option for adding specific annotations to a cell object on a per "tool" basis. So there might be a 'AIRR' obsm object, but also a 'Conga' and 'CellTypist' object. This would separate these cell annotations cleanly - would that be considered good practice.
@bcorrie, I'll try to summarize here a bit the (typical) purpose of the different AnnData slots.
.obs
has been designed to hold all cell-level metadata. I would stick to that and add the repertoire fields (disease state, tissue, etc.) to .obs
to make them readily available for plotting and other scanpy functions. So far, AnnData does not distinguish between columns added by different tools. I like the idea of having more structure in .obs
, but that's something we would need to discuss on the AnnData side. .obsm
has been designed to hold data matrices aligned to .obs
, e.g. UMAP or PCA. One could argue that Rearrangement data is not a data matrix either. An alternative would be to add an "awkward array column" to adata.obs
:
adata.obs["airr_rearrangement"] = ak.Array(...)
Yet another option would be to put the awkward array in .X
as @javh suggests.
The thing here is that having awkward columns requires some additional effort on the AnnData side and it is unclear how long this would take (see https://github.com/scverse/anndata/pull/647). Awkward arrays in obsm
already work, but the PR needs to be wrapped up. Awkward arrays in X
are not implemented either, but proably easier to fix than the pandas extension.
If they were in .obsm and the visualizations could not make use of this easily, this would be challenging...
I agree that cell-type, patient info etc. must be directly accessible for plotting (which they would be if they are in .obs
).
One can argue how useful a UMAP plot with V/D/J genes really is, similar to other information from the rearrangement schema. I would envisage something like this:
# if one really wants to plot rearrangement fields, add them to adata.obs explicitly
adata.obs["my_v_gene_col"] = ir.get("v_call", receptor_arm="VJ", dual_ir="primary")
sc.pl.umap(adata, color="my_v_gene_col")
That would be easy enough IMO and makes it explicit which "chains" to choose from the possibly multiple chains that are available for each cell.
@javh in R, did you use any specific package for immune receptor analysis?
@grst, I use the Immcantation packages, because I'm biased like that. :)
Regarding cell metadata in .osb
, that seems cleaner to me as well. You're going to want to be able to slice and copy cell metadata (1) across modalities (eg, cell types from the gene expression object, clonal clusters from rearrangement, etc) and (2) from the top level .obs
if you go the mudata route. Same for .osbm
data from other modalities (eg, look at clones on the UMAP from gene expression, correlate gene expression with mutation frequency / selection pressure from rearrangement, etc). "Look in .obs
for every modality" seems simplest.
@grst, I use the Immcantation packages, because I'm biased like that. :)
I'm asking because I was wondering if we can extend this effort beyond the scverse world and have a 1:1 mapping to a MultiAssayExperiment
representation.
@grst From the MuData perspective, it is more general than MultiAssayExperiment
. So it should be possible to align with how the data is stored there!
After the input from @javh and the discussion in https://github.com/scverse/anndata/pull/647, it seems cleanest to me to go the mudata
route with AIRR data stored in .X
. Prerequisite is that anndata will get awkward array support for .X
with two aligned, fixed-length axes.
.X
would then look something like this:
axis 0
holds observations (=cells) (fixed-length)axis 1
holds AIRR rearrengement variables (e.g. v_call
, duplicate_count
, ...) (fixed-length)axis 2
holds 0-n chains (variable-length)Data could be retrieved as follows:
mdata.mod["AIRR"][:, "junction_aa"].X
For what it's worth, on the R side we've been using MultiAssayExperiments with a "rearrangement" experiment that includes the AIRR Rearrangement data for storing multimodal single-cell data that includes AIRR data, GEX, CITE-seq, and/or whatever people dream up. The AIRR assay data (equivalent to .X) is stored as a BumpyMatrix, specifically a BumpyDataFrameMatrix
@javh, do you happen to have an example where this is implemented in R? Specifically the AIRR assay data + bumpy matrix part?
We are working on exporting AIRR Cell/Expression data in an anndata file, with AIRR metadata features as annotations. So not the Rearrangement fields, but rather the Repertoire metadata fields - so we know which cells have which subject/disease/etc characteristics.
I am not an expert on anndata and mudata so need to wrap my head around what we are talking about.
In the following:
axis 0 holds observations (=cells) (fixed-length) axis 1 holds AIRR rearrengement variables (e.g. v_call, duplicate_count, ...) (fixed-length)
I think these fixed lengths are the same, correct? That is for every entry on axis 0 there is an entry on axis 1? Or that
length(mdata.mod["AIRR"].obs) == length(mdata.mod["Cell"].obs)
If want all of the AIRR Data on dimension 1 I do the following since dimension 1 has a name AIRR
:
mdata.mod["AIRR"].X
And I get the original observations (cells) on dimension 0 in a similar way (and it would have a name as well)?
mdata.mod["GEX"].X
Would you suggest that these modality names are standardized for the different modalities of data, or are the arbitrary and up to the user?
In the following:
axis 0 holds observations (=cells) (fixed-length) axis 1 holds AIRR rearrengement variables (e.g. v_call, duplicate_count, ...) (fixed-length)
I think there is some confusion here. These axes refer to the axes of the matrix holding the AIRR data, not the obs_names
axes of the modalities. E.g. mdata.mod["AIRR"].X[axis_0, axis_1]
@bcorrie, from when you mentioned:
The AIRR assay data (equivalent to .X) is stored as a BumpyMatrix, specifically a BumpyDataFrameMatrix
Could you link to a vignette or some code where you are using a bumpy matrix to work with AIRR data?
axis 0 holds observations (=cells) (fixed-length) axis 1 holds AIRR rearrengement variables (e.g. v_call, duplicate_count, ...) (fixed-length)
I think these fixed lengths are the same, correct? That is for every entry on axis 0 there is an entry on axis 1? Or that length(mdata.mod["AIRR"].obs) == length(mdata.mod["Cell"].obs)
What I meant with axes here are the dimensions of one single anndata object, in this case the one holding AIRR data.
When talking about a mudata object with multiple modalities e.g. mdata.mod["AIRR"]
, mdata.mod["GEX"]
, the number of observations is not necessarily the same. For instance, GEX
could contain all sorts of immune cells, while AIRR data is only available for B/T cells. However, cells that are shared between the modalities would have the same identifier in .obs_names
.
Would you suggest that these modality names are standardized for the different modalities of data, or are the arbitrary and up to the user?
From the scirpy side, I would not require the modality to have a specific name, but AIRR
sounds like a reasonable convention. If this is becoming part of some AIRR standard, then it's anyway up to you, but I guess standardized modality names would make sense there.
EDIT: just saw Isaac beat me to an answer by half a minute :sweat_smile:
@grst, yep, here's an example using one of 10x's older public PBMC data sets (attached):
library(SummarizedExperiment)
library(BumpyMatrix)
library(airr)
# Load data into a BumpyMatrix
df <- read_rearrangement("pbmc3_ig_db-pass.tsv.gz")
locus <- factor(df$locus, levels=unique(df$locus))
bumpy <- splitAsBumpyMatrix(DataFrame(df), row=locus, column=df$cell_id, sparse=TRUE)
# Make SummarizedExperiment
se <- SummarizedExperiment(list(airr=bumpy))
colData(se)$Cell <- colnames(bumpy)
colData(se)$Sample <- "A"
# Extract BumpyMatrix assay from SummarizedExperiment
adata <- assay(se, "airr")
# All records for a locus
adata["IGH", ]
# All junctions for a locus
adata["IGH", , "junction"]
# All records for a cell
adata[, "AAACCTGAGTCGATAA-1"]
# All records for a cell and locus
adata["IGH", "AAACCTGAGTCGATAA-1"]
# V and J genes for a given cell and locus
adata["IGH", "AAACCTGAGTCGATAA-1", c("v_call", "j_call")]
# Back to DataFrame
x <- unlist(adata)
(I'm sure we could work up something more formal, but this what I had on hand.)
@grst, since it took me a sec, here's a quick way to install:
mamba create -n r-airr bioconductor-summarizedexperiment bioconductor-bumpymatrix r-tidyverse
conda activate r-airr
r # start an r session
install.packages("airr")
In something like data shape, the dimensions are: Regular(len(locus)) * Regular(len(cell)) * var * Record([sequence_id, sequence, rev_comp, ...])
.
Thanks @javh! This is again slightly different to what I had in mind. You introduce a separate dimension for locus
, whereas I would have gone for putting all chains into a single dimension, and store separate indices that allow to retrieve VJ (i.e. TRA/TRG/IGK/IGL) or VDJ (i.e. TRB/TRD/IGH) loci. Both is possible and maybe worth another discussion.
What I didn't understand yet: Is this format something you use in-house only or is it part of the AIRR stndard, some other community standard or any R package?
After discussion with @ivirshup, I'm still a bit on the fence if the awkward array with AIRR data should rather go into adata.X
or into adata.obsm["airr"]
.
Store as an awkward array, rely on adata.var
for the field names. i.e.
Advantages
mudata["airr"].X
adata.var_names
(without having to touch an awkward array)Disadvantages
This is what the array would look like:
[
# cell0: 2 chains
# locus , junction_aa
[["TRA", "TRB"], ["CADASGT...", "CTFDD..."]],
# cell1: 1 chain
[["IGH"], ["CDGFFA..."]],
# cell2: 0 chains
[[], []]
]
Store as an awkward array of records:
Record
(similar to python dict))Advantages:
This is what the array would look like:
[
# cell0: 2 chains
[{"locus": "TRA", "junction_aa": "CADASGT..."}, {"locus": "TRB", "junction_aa": "CTFDD..."}],
# cell1: 1 chain
[{"locus": "IGH", "junction_aa": "CDGFFA..."}],
# cell2: 0 chains
[],
]
Implementation-wise and for the end-user the difference between the two should be limited:
In both cases, there would be an array in obsm that allows to retrieve the VJ/VDJ chains:
[
"VJ_1": [0, None, None],
"VDJ_1": [1, 0, None],
]
Having an additional locus
or receptor_arm
dimension as proposed by @javh would be an alternative to this, and could be adapted to both version.
With the new datastructure, there will be a new module to easily retrieve the data roughly like this:
ir.get.airr(adata, "junction_aa", "VJ_1")
See also the discussion at #184.
To broaden the scope even further, looping more people into the discussion:
@ncborcherding is maintaining scRepertoire, the scirpy-equivalent in the Seurat world. Nick, maybe you could give us your perspective on how you store AIRR data in Seurat-objects?
@kmayerb and @phbradley, with TCRdist3 and conga you maintain other immune-cell receptor analysis packages in the python universe that are reasonably popular and complementary to scirpy. In an ideal world we would be using the same data structure for better interoperability. Maybe you are interested in joining this discussion.
I'm aware that this is quite a thread to read through, but the top-level comment describes what this is all about, and in the comment just before are the latest proposals of how this could be implemented.
@bcorrie, from when you mentioned:
The AIRR assay data (equivalent to .X) is stored as a BumpyMatrix, specifically a BumpyDataFrameMatrix
Could you link to a vignette or some code where you are using a bumpy matrix to work with AIRR data?
I think that was @javh that was talking about "bumpy matrix" - ahh, I see he answered already...
@kmayerb and @phbradley, with TCRdist3 and conga you maintain other immune-cell receptor analysis packages in the python universe that are reasonably popular and complementary to scirpy. In an ideal world we would be using the same data structure for better interoperability. Maybe you are interested in joining this discussion.
One of the tools we are integrating with (and therefore exporting to) is Conga - that is exactly our use case 8-)
@grst said:
You introduce a separate dimension for
locus
, whereas I would have gone for putting all chains into a single dimension, and store separate indices that allow to retrieve VJ (i.e. TRA/TRG/IGK/IGL) or VDJ (i.e. TRB/TRD/IGH) loci. Both is possible and maybe worth another discussion.
If I had my druthers, I would also have that index be the chain type instead of the locus, because that more naturally fits with how you're likely to work with the data (avoiding the awkwardness of always indexing on c("IGK", "IGL")
). Which just means doing this instead:
df <- read_rearrangement("~/Downloads/pbmc3_ig_db-pass.tsv.gz") %>%
mutate(chain=alakazam::getChain(locus))
chain <- factor(df$chain, levels=c("VH", "VL"))
bumpy <- splitAsBumpyMatrix(DataFrame(df), row=chain, column=df$cell_id, sparse=TRUE)
However, this has nomenclature problems and is very B cell centric. There's no comparable terminology to VH
and VL
for T cells, which instead uses alpha-beta and gamma-delta. Hence, there's no field in the AIRR Rearrangement standard that defines standard nomenclature for an abstract chain type. That's trivial to overcome. But, that does mean whatever your solution, it won't necessarily be consistent across tools. Eg, I have minor reservations about VDJ
vs VJ
as the chain type. We decided to use VH
and VL
in alakazam, which is also fine as a key, but still incorrect terminology in the case of TCRs.
I'm also not certain in makes sense to combine analyses for TRA and TRG.
Hence, locus
ended up being the index. Not ideal, because downstream analysis often starts with adding the chain
field. But, it doesn't suffer from any wrongness upon import either.
What I didn't understand yet: Is this format something you use in-house only or is it part of the AIRR stndard, some other community standard or any R package?
In-house only. It doesn't appear in a public package (that I'm aware of). And it is not part of the AIRR Standards efforts or the AIRR Data Commons stuff @bcorrie is talking about. So there's no need to conform to this data structure if you have something better in mind.
Very interesting discussion and thanks for looping me in!
As of right now, scRepertoire is appending a portion of the filtered TCR/BCR alignments to the meta data of a single cell object (either Seurat or SingleCellExperiment). It is not necessarily ideal as it does not preserve AIRR format or other information that users have been requesting, like cdr1/2 sequences.
In terms of implementing a change for consistency - from the R side of things, SingleCellExperiment functions as a SummarizeExperiment and is similar to the python equivalent. However, I don't think Seurat data format would be compatible with the array you are proposing. But I can do some investigating
I decided to go with the adata.obsm
variant described in https://github.com/scverse/scirpy/issues/327#issuecomment-1238067096, storing all chains in a single dimension rather than making an additional dimension for loci.
This makes IO agnostic of loci, which aren't standardized or even a mandatory field in rearrangement data.
Calling "primary" and "secondary" chains would then happen in a separate step, which would also allow to implement different strategies for chain ranking in the future.
If anyone has reservations against this approach, now would be a good time to speak up, otherwise it might be too late.
I'm finally making some progress on the new datastructure (#356). Here is how it is currently working out:
IO functions read in data into an awkward array in adata.obsm["airr"]
. At this point no filtering happens, all information from an AIRR rearrangement file is preserved.
# adata.obsm["airr"]
[
# cell0: 2 chains
[{"locus": "TRA", "junction_aa": "CADASGT..."}, {"locus": "TRB", "junction_aa": "CTFDD..."}],
# cell1: 1 chain
[{"locus": "IGH", "junction_aa": "CDGFFA..."}],
# cell2: 0 chains
[],
]
The new function scirpy.pp.index_chains
is called. It adds another awkward array to adata.obsm["chain_indices"]
. This function applies what is described as the scirpy receptor model, i.e. it flags multichain cells, and calls primary and secondary VJ and VDJ cells.
# adata.obsm["chain_indices"]
[
{"VJ": [0], "VDJ": [1], "multichain": False}, # single pair
{"VJ": [0, 2], "VDJ": [1,3], "multichain": False}, # dual IR
]
The beauty of this is that this can easily be adapted to other receptor models. For instance, the index_chains
function can be called with productive=False
to keep non-productive chains. It would also be thinkable to develop a completely different receptor model with an arbitrary number of chains per obs
. This could be relevant for spatial or bulk TCR data. This would require new analysis functions in scirpy (or another library), but it all fits into the same data structure.
Get AIRR data by slicing the adata.obsm["airr"]
with the indices in adata.obsm["chain_indices"]
. For convenience, there is now a helper function that allows to request one, or multiple columns from the AIRR data, e.g.
>>> scirpy.get.airr(adata, "locus", "VJ_1")
AAACCTGAGAGTGAGA-1 TRA
AAACCTGAGGCATTGG-1 TRA
AAACCTGCACCTCGTT-1 None
...
Work with MuData. MuData will be the new recommended way of combining gene expression and AIRR data.
mdata = md.MuData({"airr": adata_airr, 'gex': adata_gex})
ir.tl.chain_qc(mdata["airr"])
I still need to iron out a few details, and most importantly, update the documentation and tutorial to reflect all these changes. There will be an automatic check and a conversion function to transfer data from the previous format into the new data structure.
The new datastructure is now available in the v0.13.0rc1
pre-release :tada:!
I'm happy about everyone who tries it out and gives feedback!
You can install it using
pip install --upgrade --pre scirpy
Please also check out
@grst this is amazing thank you so much for implementing this ❤️ I can see in the tutorials how I can access Rearrangement
fields which is excellent. What about Subject
or Sample
fields from AIRR such as the age
of donor/patient or disease_diagnosis
etc; metadata that would be the same for a subset or all of the cells under obs
Happy to open a PR to make this happen 🙂
I agree - great to see this moving forward - thanks @grst and others.
@gszep we talked about Repertoire metadata earlier (https://github.com/scverse/scirpy/issues/327#issuecomment-1172887702) but I think currently scverse libraries only load AIRR Rearrangements. Would love to hear that I am wrong, but I believe that to be the case.
There is an example in the docs (https://scverse.org/scirpy/tags/v0.13.0rc1/tutorials/tutorial_io.html#Combining-multiple-samples) where multiple samples are combined, and the equivalent of the AIRR sample_id is set for multiple samples in the obs data. In this case the obs_name is used to assign a sample obs.
So the trick would be to load in an AIRR Repertoire file and process it to do something similar based on the repertoire_id in the Rearrangement data.
One day soon I hope to get to doing something like this for the Single Cell downloads from the iReceptor Gateway. Ideally one could request an h5ad download of multiple samples and get a single h5ad file with all the GEX, Cell, Rearrangement, and Repertoire metadata embedded in that single h5ad file. Currently when you do a download you get 4 separate files, one for each type of data.
Thank you for your reply.
I am still learning the best practices for anndata. Say we have a usecase of 20 repertoires each with 100,000 cells or so.
In this case, would it make sense to have 20 Anndata objects and in each anndata object there would be the 100,000 'Rearrangement' rows stored under the airr
path and the AIRR Repertoire
stored in uns
or similar?
I am unfortunately not an anndata/scirpy expert either, I come from the AIRR side. My simple answer would be it depends on if you want to compare the repertoires. If yes, then having them in a single object is probably best (as in the example - you have to worry about batch effects). That way you can slice and dice across any of the metadata fields from the Repertoire.
I don't yet have any experience with how well scirpy and related tools will scale when you are throwing together data sets of this size. Any scirpy experts want to comment.
Thanks. Scaling is my main concern. If I want to filter my Repertoires before using their Rearrangements I'm not sure it's efficient to use a single object. I guess one might be able to use the concatenate feature which will put multiple AnnData objects into a single object if you require it. It would be great if someone from the scirpy side could offer a more informed view than mine!
Yes, I would guess keeping them separate until you want to compare them, then concatenate them as required for specific comparisons. Still an interesting scalability question as to how many samples can you practically compare in a single scirpy object.
Now that scirpy is part of scverse, we could think of an improved data structure for scAIRR data. See also the discussion at https://github.com/theislab/scanpy/issues/1387.
The challenge with scAIRR data is that
1
cell can haven
chains. Up to four of them are biologically meaningful but there could be more for technical reasons.The current pragmatic solution is to store all fields in
adata.obs
.adata.obs
. Also serializing excess chains is not really elegant.New options are
adata.obs
.The new representation should also aim at being a community standard for the scverse ecosystem and should build upon the AIRR rearrangement standard. Ideally, we could get additional stakeholders onboard, including conga, dandelion, tcrdist3 and possibly members of the AIRR community.