scverse / scirpy

A scanpy extension to analyse single-cell TCR and BCR data.
https://scirpy.scverse.org/en/latest/
BSD 3-Clause "New" or "Revised" License
205 stars 32 forks source link

scverse datastructure for AIRR data #327

Closed grst closed 1 year ago

grst commented 2 years ago

Now that scirpy is part of scverse, we could think of an improved data structure for scAIRR data. See also the discussion at https://github.com/theislab/scanpy/issues/1387.

The challenge with scAIRR data is that

The current pragmatic solution is to store all fields in adata.obs.

New options are

The new representation should also aim at being a community standard for the scverse ecosystem and should build upon the AIRR rearrangement standard. Ideally, we could get additional stakeholders onboard, including conga, dandelion, tcrdist3 and possibly members of the AIRR community.

bcorrie commented 2 years ago

Hi All, I am active in the AIRR Standards group, can try and answer questions. We are looking at integrating scirpy into the iReceptor Gateway (gateway.ireceptor.org) to do analysis on AIRR Cell data - so have a vested interest... 8-)

The AIRR Cell schema was just announce at the AIRR meeting this week (https://www.antibodysociety.org/the-airr-community/meetings/airr-community-meeting-vi-exploring-new-frontiers/), with a tagged (v1.4) release expected on github in the next few weeks. This is an "Experimental" schema, meaning that we think it is pretty stable, but will probably need some minor changes before it is finalized, probably in an AIRR Standard v2.0 release sometime in the "near" future.

bcorrie commented 2 years ago

The Cell schema is intended to represent a Cell as an object. It is a light weight object, and contains a cell_id that is used to cross reference information in other collections such as Rearrangement and CellExpression. So if you have a cell_id from a Cell you can find all of the Rearrangements for the paired chains associated with the cell in an AIRR TSV file by searching for that same cell_id. Similarly you can find all of the CellExpression information by searching a CellExpression file for that same cell_id.

bcorrie commented 2 years ago

Here are the current docs on the master branch, which will be tagged with a v1.4 release soon.

https://docs.airr-community.org/en/latest/datarep/cell.html

bcorrie commented 2 years ago

One of the outstanding things we need to resolve in the AIRR Standard is what if any "efficient" file format should be supported. Currently the default standard for all AIRR objects is a JSON representation of the standard objects. For Rearrangement there is a defined TSV file. Such a file format does not yet exist for Cell and more importantly CellExpression. We could probably use some advice from this community...

We have an about to be release version of our AIRR repository (https://github.com/sfu-ireceptor/turnkey-service-php/tree/production-v4) that can store Cell and CellExpression data, but currently what you get out is a very big, verbose JSON file. It works, but...

bcorrie commented 2 years ago

BTW, I think it would be interesting to decorate a cell's adata.obs with AIRR Repertoire metadata. That is, it would be quite easy (I think) to decorate the adata.obs with all sorts of subject/sample metadata for each cell such as subject.disease.disease_daignosis, subject.age_min, subject.sex, sample.tissue etc so that one can use these criteria for coloring, group, etc as well. It seems to me that these fields in adata.obs would be more valuable than the most of the Rearrangement fields. I could see wanting some Rearrangement fields, maybe VDJ calls, CDR3 length, locus for the paired chains, but it seems like most of the Rearrangement fields would not be very interesting from a slicing/coloring/visualization point of view.

grst commented 2 years ago

Hi @bcorrie,

thanks for reaching out, I'm definitely interested in working with the AIRR community to make this as interoperable as possible. Your description of the Cell schema is really helpful, before I wasn't sure it was to replace the Rearrangement schema, but now I understood it's built on top of it.

Actually scirpy internally uses something very very similar to the Cell schema already. Every file format read into scirpy is first represented as a list of AirrCell objects that have a cell_id, a list of Rearrangement entries and arbitrary cell-level attributes that are added as columns to adata.obs. Once the Cell schema is final, I could simply adjust my AirrCell implementation to match the cell-schema, which would make reading and writing AIRR-compliant data trivial.

The internal representation is one side of the medal, the other is representing the AIRR data as part of either an AnnData or MuData container, which allows to link the data to other modalities, allows efficient file storage in h5ad/h5mu and enables the interaction with scanpy or other scverse tools. The current way is to dump everything in .obs which is not ideal.


Such a file format does not yet exist for Cell and more importantly CellExpression. We could probably use some advice from this community...

Well, how about AnnData? If you are looking for a lower level of abstraction, I think hdf5, zarr or tiledb can handle these sparse data efficiently, but @ivirshup or @gtca are the experts here. Is it really necessary to define yet another format for single-cell expression?

Note that reading/writing gene expression formats would be out-of-scope for scirpy, this should be handled by scanpy or AnnData, or possibly by a future scverse IO package.

It's also worth noting here that the CZI is currently looking into standardizing Feature Observation Matrices (FOM) for single-cell data. Maybe it would be worth for the AIRR community and them getting in touch for talking about a AIRR-specific extension. Their public presence is still very sparse, but maybe @ivirshup can tell more about this initiative and make the contact.

Best, Gregor

bcorrie commented 2 years ago

Such a file format does not yet exist for Cell and more importantly CellExpression. We could probably use some advice from this community...

Well, how about AnnData? If you are looking for a lower level of abstraction, I think hdf5, zarr or tiledb can handle these sparse data efficiently, but @ivirshup or @gtca are the experts here. Is it really necessary to define yet another format for single-cell expression?

Yes, AnnData seems logical and along the lines of what I was thinking... Definitely do not want YAF (yet another format) 8-)

Like you, this is likely out of scope for the AIRR Community - at least in the sense that the obvious thing for us to do would be to have the AIRR python library (https://github.com/airr-community/airr-standards/tree/master/lang/python) use scanpy to read and write these formats.

The CZI FOM is interesting as well...

bcorrie commented 2 years ago

Actually scirpy internally uses something very very similar to the Cell schema already. Every file format read into scirpy is first represented as a list of AirrCell objects that have a cell_id, a list of Rearrangement entries and arbitrary cell-level attributes that are added as columns to adata.obs. Once the Cell schema is final, I could simply adjust my AirrCell implementation to match the cell-schema, which would make reading and writing AIRR-compliant data trivial.

Very nice - one would hope that these would be closely aligned, and hopefully we can now make sure they are completely aligned 8-)

grst commented 2 years ago

Draft proposal for an awkward-array-based data structure

I've been playing around with the draft implementation of awkward array support in AnnData (https://github.com/scverse/anndata/pull/647). This allows to store in .obsm (or in the future maybe also in .obs, see the comments there) ragged arrays of mixed types for each cell -- exactely what we need to represent the 1:n relationship of cells with chains.

An "awkward array" can be directly created from a list of AirrRearrangement dictionaries, e.g.

import awkward as ak
import scirpy as ir
airr_cells = ir.io.to_airr_cells(adata)
chains_as_awkarr = ak.Array([cell.chains for cell in airr_cells])
# an array of length 3000 (= number of cells), variable width (= number of chains) and 
# AIRR Rearrangement fields on the 3rd dimension
> chains_as_awkarr
<Array [[{c_call: 'TRBC2', ... v_cigar: None}]] type='3000 * var * {"c_call": op...'>

To get, e.g. all junction_aa sequences of the first chain of each cell, this awkward array can be sliced:

> chains_as_awkarr[:, 0, "junction_aa"]
<Array ['CASSLMRLAGDTQYF', ... 'CATEEYGNKLVF'] type='3000 * string'>

It is therefore straightforward and computationally efficient to retrieve individual elements.

In summary: We simply take AIRR compliant data, put it in an efficient data structure and store it an AnnData. Not a lot new to define from our side

Caveats and possible solutions

1. Primary and Secondary chains

The list of chains does not know the concept of primary and secondary chains. This can be solved by adding an index array:

# the index arrays point to the element in the list of chains
primary_vj = [0, 0, 1, 2, 0, 1, ...]
primary_vdj = [1, 3, 5, 3, 2, ...]

The awkward array can then be sliced

> chains_as_awkarr[:, primary_vj, "junction_aa"]

to retrieve e,g. the junction AA sequences for the primary VJ chains.

2. Getting values for plotting

The current solution to put everything in obs was amongst others a pragmatic solution to make all AIRR data immediately available to scanpy plotting functions. This is not possible anymore with the proposed new implementation. Arguably, most of those columns are used very rarely for those purposes anyway.

If they are, here are some solutions to do so

3. Where to store the data

The current draft implementation only supports awkward arrays in .obsm. IMO it would be completely fine to have an .obsm["airr"]. In the future, it may be possible to have a column .obs["airr"].

It was also previously suggested to use mudata as AIRR data can be considered a modality. Since we don't have any data for .X and .var, I don't see a clear advantage of using mudata, though.


I'm looking forward to feedback, especially from people that develop ecosystem packages for AIRR analysis (@zktuong, ..., ?)

grst commented 2 years ago

@bcorrie, this should make it possible that we merely define a mapping how the AIRR standard is represented as AnnData object rather than converting between a scirpy and an AIRR representation. Need to hash that out in more detail, but roughly

ivirshup commented 2 years ago

The CZI FOM is interesting as well...

@bcorrie from discussion with the SOMA team, the intent was to keep anndata and SOMA (formerly FOM) pretty consistent with each other. Ideally they are the same thing eventually.

Though I haven't heard much on SOMA development recently, so hopefully we are still on the same page here.

Where to store the data

@grst, I've recently heard people wanting to treat their re-arrangment data as a separate modality. E.g. have an AnnData with just re-arrangement info which could then be merged with other modalities. I think the main point here was to avoid having to arbitrarily associate AIRR data with another modality.

I think @bio-la and @crichgriffin could elaborate on this.

grst commented 2 years ago

I've recently heard people wanting to treat their re-arrangment data as a separate modality.

Wouldn't it anyway work with both anndata and mudata to use .obsm["airr"].

# anndata
ir.tl.something(adata)
# anndata, override obsm_key
ir.tl.something(adata, airr_key="ir")
# mudata
ir.tl.something(mdata.mod["AIRR"])

Or would you prefer something like

# mudata only, use `.mod["AIRR"]` by default
ir.tl.something(mdata)
# ... or override default modality
ir.tl.something(mdata, mod="myairr")
ivirshup commented 2 years ago

I don't remember the details, but one of the issues was with subsetting. Wanting to filter AIRR obs, but not the RNA.

I think it would work with AnnData right now, I think the issue was more about want to use scirpy with a MuData where RNA and AIRR were in separate modalities? I think this sorta worked, but there were issues with mudata operations.

crichgriffin commented 2 years ago

In the context of multimodal data and using MuData, I like the idea of keeping the AIRR data separate from any specific modality. My current workflow is to store the scirpy anndata object as a separate modality in muon, where I also have rna and adt data, as separate modalities. But I agree that it doesn't make a lot of sense, since the scirpy anndata object is mostly empty.

I agree with @grst that storing airr data could potentially work in both AnnData and MuData by using .obsm["airr"] except (iirc) in AnnData .obsm row dimensions match .obs row dimensions, and I assume the same applies to MuData. What would you do in the instances of cells not having AIRR data? E.g. in PBMC data, monocytes would not have airr data, but you would want to keep them in your AnnData/MuData.

grst commented 2 years ago

What would you do in the instances of cells not having AIRR data?

fill the corresponding rows with NaNs? This is how scirpy also handles this currently. Although this is a good point in favor of mudata. It would also make ir.pp.merge_with_ir superfluous.

I think it should be easy to support both. Just need to think what to advertise as default in the tutorial.

zktuong commented 2 years ago

all of this sounds awesome. Just have some naive question about how well/fast does it deal with situation when there's >100k+ cells? would chains_as_awkarr take a while to retrieve the entries?

grst commented 2 years ago

That's a fair point, I need to try this out, but it should be very fast. Awkward array claims to be similarly scalable as numpy arrays.

bcorrie commented 2 years ago

Hi All, sorry for the lack of input - both fighting "the big C" and travelling to meetings (I don't recommend trying to both at the same time) the last several weeks.

I am not an AnnData expert - yet - but my observations thus far.

We are experimenting with Conga and it annotates each AnnData cell with heavy and light chain VDJ/CDR3 in the .obs of the object. Other tools like CellTypist also populate the .obs with other data.

This seems messy, and .obsm seems like a good option for adding specific annotations to a cell object on a per "tool" basis. So there might be a 'AIRR' obsm object, but also a 'Conga' and 'CellTypist' object. This would separate these cell annotations cleanly - would that be considered good practice. It seems to make sense to me that the community might encourage this???

bcorrie commented 2 years ago

2. Getting values for plotting

The current solution to put everything in obs was amongst others a pragmatic solution to make all AIRR data immediately available to scanpy plotting functions. This is not possible anymore with the proposed new implementation. Arguably, most of those columns are used very rarely for those purposes anyway.

Yes, it occurred to me that plotting might be challenging using obsm. I personally think this would be quite important.

I would anticipate using the obsm['airr'] to store not only rearrangement annotations, but also what we call repertoire annotations (by repertoire AIRR means study/subject/sample metadata). For example, it is quite easy for us to generate a pool of cell data from many subjects in a study, possibly across disease conditions (healthy, mild covid, severe covid). We would pool that data and create a single h5ad file. We would want the 'disease_state' and 'subject_id' to be in the AIRR obsm object so we could visualize cells based on these fields. Same for the AIRR 'tissue' field and a range of others potentially.

It seems like a bad idea to have this difficult.

bcorrie commented 2 years ago

A quick example - this is from a combined Conga/CellTypist analysis, with the .obs annotated with Conga and Cell Typist fields. This is trivial to visualize VDJ annotations per cell next to CellTypist majortiy voting - at the same time - because the columns are added to .obs. If they were in .obsm and the visualizations could not make use of this easily, this would be challenging...

image

'vb', 'jb', and 'va' are from Conga, while 'majority_voting' is from CellTypist.

I just threw these data sets together today to try and explore this data for our early experiments with CellTypist. So please be patient with the data oddities with TR call against projected B-cells. I have yet to validate anything from this image in terms of it making sense - and any and all errors are mine 8-)

The main point is that these experimentations are easy because the visualizations are easy.

bcorrie commented 2 years ago

@bcorrie, this should make it possible that we merely define a mapping how the AIRR standard is represented as AnnData object rather than converting between a scirpy and an AIRR representation. Need to hash that out in more detail, but roughly

  • Cell metadata -> adata.obs
  • Rearrangement chains -> adata.obsm["airr"] (as awkward array)
  • CellExpression -> adata.X or mudata (in case of CITE-seq)

This seems to make sense to me as a first pass, although I would probably suggest that adata.obsm['airr'] might contain more than just info about rearrangement chains (see https://github.com/scverse/scirpy/issues/327#issuecomment-1172887702) as AIRR Repertoire metadata (disease_state, tissue, age, sex, ...) on a cell basis will often be very valuable.

One could have adata.obsm['airr_rearrangement'] and adata.obsm['airr_repertoire'] - but maybe too clunky? Another related link that cells potentially have is to "Receptors" - which are known B/T cell receptors that have a specific antigen/epitope specificity (think of the B and T cell specificity info in IEDB). Again, maybe worth capturing.

javh commented 2 years ago

For what it's worth, on the R side we've been using MultiAssayExperiments with a "rearrangement" experiment that includes the AIRR Rearrangement data for storing multimodal single-cell data that includes AIRR data, GEX, CITE-seq, and/or whatever people dream up. The AIRR assay data (equivalent to .X) is stored as a BumpyMatrix, specifically a BumpyDataFrameMatrix. We have a primary key (row names) derived from the locus field to easily separate out IGH/TRB, but I don't think that's necessary. Sample/cell level metadata goes into the colData (.obs) as is typical.

If I'm understanding the awkward array correctly, and I may not be, this would be the same as using the "record" array implementation to populate .X with an awkward array in a mudata object. A pandas DataFrame with multi-indexing seems like the most natural fit for working cellular Rearrangement data (eg, key on something like ['cell_id', 'locus', 'sequence_id']) and it looks like the conversion from awkward array to multi-indexed DataFrame is trivial. Unfortunately, my python is a bit rusty these days, so I could be misunderstanding.

PS: "We" is not the AIRR Standards WG in this case. I don't think we should have an official opinion on implementation.

grst commented 2 years ago

where to store the data (.obs vs. .obsm vs. .X)

The AIRR assay data (equivalent to .X) is stored as a BumpyMatrix, specifically a BumpyDataFrameMatrix. We have a primary key (row names) derived from the locus field to easily separate out IGH/TRB, but I don't think that's necessary. Sample/cell level metadata goes into the colData (.obs) as is typical.

@javh this sounds pretty much like what I would imagine! Interesting to have the BumpyMatrix in .X. In combination with mudata this actually makes a lot of sense and would be consistent to other modalities.

I would probably suggest that adata.obsm['airr'] might contain more than just info about rearrangement chains (see https://github.com/scverse/scirpy/issues/327#issuecomment-1172887702) as AIRR Repertoire metadata (disease_state, tissue, age, sex, ...) on a cell basis will often be very valuable.

Other tools like CellTypist also populate the .obs with other data. This seems messy, and .obsm seems like a good option for adding specific annotations to a cell object on a per "tool" basis. So there might be a 'AIRR' obsm object, but also a 'Conga' and 'CellTypist' object. This would separate these cell annotations cleanly - would that be considered good practice.

@bcorrie, I'll try to summarize here a bit the (typical) purpose of the different AnnData slots.


accessing data for plotting

If they were in .obsm and the visualizations could not make use of this easily, this would be challenging...

I agree that cell-type, patient info etc. must be directly accessible for plotting (which they would be if they are in .obs). One can argue how useful a UMAP plot with V/D/J genes really is, similar to other information from the rearrangement schema. I would envisage something like this:

# if one really wants to plot rearrangement fields, add them to adata.obs explicitly
adata.obs["my_v_gene_col"] = ir.get("v_call", receptor_arm="VJ", dual_ir="primary")
sc.pl.umap(adata, color="my_v_gene_col")

That would be easy enough IMO and makes it explicit which "chains" to choose from the possibly multiple chains that are available for each cell.

grst commented 2 years ago

@javh in R, did you use any specific package for immune receptor analysis?

javh commented 2 years ago

@grst, I use the Immcantation packages, because I'm biased like that. :)

Regarding cell metadata in .osb, that seems cleaner to me as well. You're going to want to be able to slice and copy cell metadata (1) across modalities (eg, cell types from the gene expression object, clonal clusters from rearrangement, etc) and (2) from the top level .obs if you go the mudata route. Same for .osbm data from other modalities (eg, look at clones on the UMAP from gene expression, correlate gene expression with mutation frequency / selection pressure from rearrangement, etc). "Look in .obs for every modality" seems simplest.

grst commented 2 years ago

@grst, I use the Immcantation packages, because I'm biased like that. :)

I'm asking because I was wondering if we can extend this effort beyond the scverse world and have a 1:1 mapping to a MultiAssayExperiment representation.

gtca commented 2 years ago

@grst From the MuData perspective, it is more general than MultiAssayExperiment. So it should be possible to align with how the data is stored there!

grst commented 2 years ago

After the input from @javh and the discussion in https://github.com/scverse/anndata/pull/647, it seems cleanest to me to go the mudata route with AIRR data stored in .X. Prerequisite is that anndata will get awkward array support for .X with two aligned, fixed-length axes.

.X would then look something like this:

data sketch

Data could be retrieved as follows:

mdata.mod["AIRR"][:, "junction_aa"].X
grst commented 1 year ago

For what it's worth, on the R side we've been using MultiAssayExperiments with a "rearrangement" experiment that includes the AIRR Rearrangement data for storing multimodal single-cell data that includes AIRR data, GEX, CITE-seq, and/or whatever people dream up. The AIRR assay data (equivalent to .X) is stored as a BumpyMatrix, specifically a BumpyDataFrameMatrix

@javh, do you happen to have an example where this is implemented in R? Specifically the AIRR assay data + bumpy matrix part?

bcorrie commented 1 year ago

We are working on exporting AIRR Cell/Expression data in an anndata file, with AIRR metadata features as annotations. So not the Rearrangement fields, but rather the Repertoire metadata fields - so we know which cells have which subject/disease/etc characteristics.

I am not an expert on anndata and mudata so need to wrap my head around what we are talking about.

In the following:

axis 0 holds observations (=cells) (fixed-length) axis 1 holds AIRR rearrengement variables (e.g. v_call, duplicate_count, ...) (fixed-length)

I think these fixed lengths are the same, correct? That is for every entry on axis 0 there is an entry on axis 1? Or that length(mdata.mod["AIRR"].obs) == length(mdata.mod["Cell"].obs)

If want all of the AIRR Data on dimension 1 I do the following since dimension 1 has a name AIRR:

mdata.mod["AIRR"].X

And I get the original observations (cells) on dimension 0 in a similar way (and it would have a name as well)?

mdata.mod["GEX"].X

Would you suggest that these modality names are standardized for the different modalities of data, or are the arbitrary and up to the user?

ivirshup commented 1 year ago

In the following:

axis 0 holds observations (=cells) (fixed-length) axis 1 holds AIRR rearrengement variables (e.g. v_call, duplicate_count, ...) (fixed-length)

I think there is some confusion here. These axes refer to the axes of the matrix holding the AIRR data, not the obs_names axes of the modalities. E.g. mdata.mod["AIRR"].X[axis_0, axis_1]


@bcorrie, from when you mentioned:

The AIRR assay data (equivalent to .X) is stored as a BumpyMatrix, specifically a BumpyDataFrameMatrix

Could you link to a vignette or some code where you are using a bumpy matrix to work with AIRR data?

grst commented 1 year ago

axis 0 holds observations (=cells) (fixed-length) axis 1 holds AIRR rearrengement variables (e.g. v_call, duplicate_count, ...) (fixed-length)

I think these fixed lengths are the same, correct? That is for every entry on axis 0 there is an entry on axis 1? Or that length(mdata.mod["AIRR"].obs) == length(mdata.mod["Cell"].obs)

What I meant with axes here are the dimensions of one single anndata object, in this case the one holding AIRR data.

When talking about a mudata object with multiple modalities e.g. mdata.mod["AIRR"], mdata.mod["GEX"], the number of observations is not necessarily the same. For instance, GEX could contain all sorts of immune cells, while AIRR data is only available for B/T cells. However, cells that are shared between the modalities would have the same identifier in .obs_names.

Would you suggest that these modality names are standardized for the different modalities of data, or are the arbitrary and up to the user?

From the scirpy side, I would not require the modality to have a specific name, but AIRR sounds like a reasonable convention. If this is becoming part of some AIRR standard, then it's anyway up to you, but I guess standardized modality names would make sense there.

EDIT: just saw Isaac beat me to an answer by half a minute :sweat_smile:

javh commented 1 year ago

@grst, yep, here's an example using one of 10x's older public PBMC data sets (attached):

library(SummarizedExperiment)
library(BumpyMatrix)
library(airr)

# Load data into a BumpyMatrix
df <- read_rearrangement("pbmc3_ig_db-pass.tsv.gz")
locus <- factor(df$locus, levels=unique(df$locus))
bumpy <- splitAsBumpyMatrix(DataFrame(df), row=locus, column=df$cell_id, sparse=TRUE)

# Make SummarizedExperiment
se <- SummarizedExperiment(list(airr=bumpy))
colData(se)$Cell <- colnames(bumpy)
colData(se)$Sample <- "A"

# Extract BumpyMatrix assay from SummarizedExperiment
adata <- assay(se, "airr")

# All records for a locus
adata["IGH", ]

# All junctions for a locus
adata["IGH", , "junction"]

# All records for a cell
adata[, "AAACCTGAGTCGATAA-1"]

# All records for a cell and locus
adata["IGH", "AAACCTGAGTCGATAA-1"]

# V and J genes for a given cell and locus
adata["IGH", "AAACCTGAGTCGATAA-1", c("v_call", "j_call")]

# Back to DataFrame
x <- unlist(adata)

pbmc3_ig_db-pass.tsv.gz

(I'm sure we could work up something more formal, but this what I had on hand.)

ivirshup commented 1 year ago

@grst, since it took me a sec, here's a quick way to install:

mamba create -n r-airr bioconductor-summarizedexperiment bioconductor-bumpymatrix r-tidyverse
conda activate r-airr

r # start an r session
install.packages("airr")

In something like data shape, the dimensions are: Regular(len(locus)) * Regular(len(cell)) * var * Record([sequence_id, sequence, rev_comp, ...]).

grst commented 1 year ago

Thanks @javh! This is again slightly different to what I had in mind. You introduce a separate dimension for locus, whereas I would have gone for putting all chains into a single dimension, and store separate indices that allow to retrieve VJ (i.e. TRA/TRG/IGK/IGL) or VDJ (i.e. TRB/TRD/IGH) loci. Both is possible and maybe worth another discussion.

What I didn't understand yet: Is this format something you use in-house only or is it part of the AIRR stndard, some other community standard or any R package?


After discussion with @ivirshup, I'm still a bit on the fence if the awkward array with AIRR data should rather go into adata.X or into adata.obsm["airr"].

adata.X

Store as an awkward array, rely on adata.var for the field names. i.e.

Advantages

Disadvantages

This is what the array would look like:

[
    # cell0: 2 chains
    # locus        , junction_aa
    [["TRA", "TRB"], ["CADASGT...", "CTFDD..."]],
    # cell1: 1 chain
    [["IGH"], ["CDGFFA..."]],
    # cell2: 0 chains
    [[], []]
]

adata.obsm

Store as an awkward array of records:

Advantages:

This is what the array would look like:

[
    # cell0: 2 chains
    [{"locus": "TRA", "junction_aa": "CADASGT..."}, {"locus": "TRB", "junction_aa": "CTFDD..."}],
    # cell1: 1 chain
    [{"locus": "IGH", "junction_aa": "CDGFFA..."}],
    # cell2: 0 chains
    [],
]

Consequences

Implementation-wise and for the end-user the difference between the two should be limited:

grst commented 1 year ago

To broaden the scope even further, looping more people into the discussion:

@ncborcherding is maintaining scRepertoire, the scirpy-equivalent in the Seurat world. Nick, maybe you could give us your perspective on how you store AIRR data in Seurat-objects?

@kmayerb and @phbradley, with TCRdist3 and conga you maintain other immune-cell receptor analysis packages in the python universe that are reasonably popular and complementary to scirpy. In an ideal world we would be using the same data structure for better interoperability. Maybe you are interested in joining this discussion.

I'm aware that this is quite a thread to read through, but the top-level comment describes what this is all about, and in the comment just before are the latest proposals of how this could be implemented.

bcorrie commented 1 year ago

@bcorrie, from when you mentioned:

The AIRR assay data (equivalent to .X) is stored as a BumpyMatrix, specifically a BumpyDataFrameMatrix

Could you link to a vignette or some code where you are using a bumpy matrix to work with AIRR data?

I think that was @javh that was talking about "bumpy matrix" - ahh, I see he answered already...

bcorrie commented 1 year ago

@kmayerb and @phbradley, with TCRdist3 and conga you maintain other immune-cell receptor analysis packages in the python universe that are reasonably popular and complementary to scirpy. In an ideal world we would be using the same data structure for better interoperability. Maybe you are interested in joining this discussion.

One of the tools we are integrating with (and therefore exporting to) is Conga - that is exactly our use case 8-)

javh commented 1 year ago

@grst said:

You introduce a separate dimension for locus, whereas I would have gone for putting all chains into a single dimension, and store separate indices that allow to retrieve VJ (i.e. TRA/TRG/IGK/IGL) or VDJ (i.e. TRB/TRD/IGH) loci. Both is possible and maybe worth another discussion.

If I had my druthers, I would also have that index be the chain type instead of the locus, because that more naturally fits with how you're likely to work with the data (avoiding the awkwardness of always indexing on c("IGK", "IGL")). Which just means doing this instead:

df <- read_rearrangement("~/Downloads/pbmc3_ig_db-pass.tsv.gz") %>%
    mutate(chain=alakazam::getChain(locus))
chain <- factor(df$chain, levels=c("VH", "VL"))
bumpy <- splitAsBumpyMatrix(DataFrame(df), row=chain, column=df$cell_id, sparse=TRUE)

However, this has nomenclature problems and is very B cell centric. There's no comparable terminology to VH and VL for T cells, which instead uses alpha-beta and gamma-delta. Hence, there's no field in the AIRR Rearrangement standard that defines standard nomenclature for an abstract chain type. That's trivial to overcome. But, that does mean whatever your solution, it won't necessarily be consistent across tools. Eg, I have minor reservations about VDJ vs VJ as the chain type. We decided to use VH and VL in alakazam, which is also fine as a key, but still incorrect terminology in the case of TCRs.

I'm also not certain in makes sense to combine analyses for TRA and TRG.

Hence, locus ended up being the index. Not ideal, because downstream analysis often starts with adding the chain field. But, it doesn't suffer from any wrongness upon import either.

What I didn't understand yet: Is this format something you use in-house only or is it part of the AIRR stndard, some other community standard or any R package?

In-house only. It doesn't appear in a public package (that I'm aware of). And it is not part of the AIRR Standards efforts or the AIRR Data Commons stuff @bcorrie is talking about. So there's no need to conform to this data structure if you have something better in mind.

ncborcherding commented 1 year ago

Very interesting discussion and thanks for looping me in!

As of right now, scRepertoire is appending a portion of the filtered TCR/BCR alignments to the meta data of a single cell object (either Seurat or SingleCellExperiment). It is not necessarily ideal as it does not preserve AIRR format or other information that users have been requesting, like cdr1/2 sequences.

In terms of implementing a change for consistency - from the R side of things, SingleCellExperiment functions as a SummarizeExperiment and is similar to the python equivalent. However, I don't think Seurat data format would be compatible with the array you are proposing. But I can do some investigating

grst commented 1 year ago

I decided to go with the adata.obsm variant described in https://github.com/scverse/scirpy/issues/327#issuecomment-1238067096, storing all chains in a single dimension rather than making an additional dimension for loci. This makes IO agnostic of loci, which aren't standardized or even a mandatory field in rearrangement data. Calling "primary" and "secondary" chains would then happen in a separate step, which would also allow to implement different strategies for chain ranking in the future.

If anyone has reservations against this approach, now would be a good time to speak up, otherwise it might be too late.

grst commented 1 year ago

I'm finally making some progress on the new datastructure (#356). Here is how it is currently working out:

  1. IO functions read in data into an awkward array in adata.obsm["airr"]. At this point no filtering happens, all information from an AIRR rearrangement file is preserved.

    # adata.obsm["airr"]
    [
       # cell0: 2 chains
       [{"locus": "TRA", "junction_aa": "CADASGT..."}, {"locus": "TRB", "junction_aa": "CTFDD..."}],
       # cell1: 1 chain
       [{"locus": "IGH", "junction_aa": "CDGFFA..."}],
       # cell2: 0 chains
       [],
    ]
  2. The new function scirpy.pp.index_chains is called. It adds another awkward array to adata.obsm["chain_indices"]. This function applies what is described as the scirpy receptor model, i.e. it flags multichain cells, and calls primary and secondary VJ and VDJ cells.

    # adata.obsm["chain_indices"]
    [
       {"VJ": [0], "VDJ": [1], "multichain": False}, # single pair
       {"VJ": [0, 2], "VDJ": [1,3], "multichain": False}, # dual IR
    ]

    The beauty of this is that this can easily be adapted to other receptor models. For instance, the index_chains function can be called with productive=False to keep non-productive chains. It would also be thinkable to develop a completely different receptor model with an arbitrary number of chains per obs. This could be relevant for spatial or bulk TCR data. This would require new analysis functions in scirpy (or another library), but it all fits into the same data structure.

  3. Get AIRR data by slicing the adata.obsm["airr"] with the indices in adata.obsm["chain_indices"]. For convenience, there is now a helper function that allows to request one, or multiple columns from the AIRR data, e.g.

    >>> scirpy.get.airr(adata, "locus", "VJ_1")
    AAACCTGAGAGTGAGA-1     TRA 
    AAACCTGAGGCATTGG-1     TRA
    AAACCTGCACCTCGTT-1    None
    ...
  4. Work with MuData. MuData will be the new recommended way of combining gene expression and AIRR data.

    mdata = md.MuData({"airr": adata_airr, 'gex': adata_gex})
    ir.tl.chain_qc(mdata["airr"])

I still need to iron out a few details, and most importantly, update the documentation and tutorial to reflect all these changes. There will be an automatic check and a conversion function to transfer data from the previous format into the new data structure.

grst commented 1 year ago

The new datastructure is now available in the v0.13.0rc1 pre-release :tada:! I'm happy about everyone who tries it out and gives feedback!

You can install it using

pip install --upgrade --pre scirpy

Please also check out

gszep commented 1 year ago

@grst this is amazing thank you so much for implementing this ❤️ I can see in the tutorials how I can access Rearrangement fields which is excellent. What about Subject or Sample fields from AIRR such as the age of donor/patient or disease_diagnosis etc; metadata that would be the same for a subset or all of the cells under obs

Happy to open a PR to make this happen 🙂

bcorrie commented 1 year ago

I agree - great to see this moving forward - thanks @grst and others.

@gszep we talked about Repertoire metadata earlier (https://github.com/scverse/scirpy/issues/327#issuecomment-1172887702) but I think currently scverse libraries only load AIRR Rearrangements. Would love to hear that I am wrong, but I believe that to be the case.

bcorrie commented 1 year ago

There is an example in the docs (https://scverse.org/scirpy/tags/v0.13.0rc1/tutorials/tutorial_io.html#Combining-multiple-samples) where multiple samples are combined, and the equivalent of the AIRR sample_id is set for multiple samples in the obs data. In this case the obs_name is used to assign a sample obs.

So the trick would be to load in an AIRR Repertoire file and process it to do something similar based on the repertoire_id in the Rearrangement data.

One day soon I hope to get to doing something like this for the Single Cell downloads from the iReceptor Gateway. Ideally one could request an h5ad download of multiple samples and get a single h5ad file with all the GEX, Cell, Rearrangement, and Repertoire metadata embedded in that single h5ad file. Currently when you do a download you get 4 separate files, one for each type of data.

jday1 commented 1 year ago

Thank you for your reply.

I am still learning the best practices for anndata. Say we have a usecase of 20 repertoires each with 100,000 cells or so.

In this case, would it make sense to have 20 Anndata objects and in each anndata object there would be the 100,000 'Rearrangement' rows stored under the airr path and the AIRR Repertoire stored in uns or similar?

bcorrie commented 1 year ago

I am unfortunately not an anndata/scirpy expert either, I come from the AIRR side. My simple answer would be it depends on if you want to compare the repertoires. If yes, then having them in a single object is probably best (as in the example - you have to worry about batch effects). That way you can slice and dice across any of the metadata fields from the Repertoire.

I don't yet have any experience with how well scirpy and related tools will scale when you are throwing together data sets of this size. Any scirpy experts want to comment.

jday1 commented 1 year ago

Thanks. Scaling is my main concern. If I want to filter my Repertoires before using their Rearrangements I'm not sure it's efficient to use a single object. I guess one might be able to use the concatenate feature which will put multiple AnnData objects into a single object if you require it. It would be great if someone from the scirpy side could offer a more informed view than mine!

bcorrie commented 1 year ago

Yes, I would guess keeping them separate until you want to compare them, then concatenate them as required for specific comparisons. Still an interesting scalability question as to how many samples can you practically compare in a single scirpy object.