scverse datastructure for AIRR data

grst commented 2 years ago

Now that scirpy is part of scverse, we could think of an improved data structure for scAIRR data. See also the discussion at https://github.com/theislab/scanpy/issues/1387.

The challenge with scAIRR data is that

1 cell can have n chains. Up to four of them are biologically meaningful but there could be more for technical reasons.
Each chain has a lot of fields. See the AIRR rearrangement standard.

The current pragmatic solution is to store all fields in adata.obs.

All columns from the airr rearrangement schema are repeated four times
Excess chains are serialized into JSON and stored in an extra column. These chains are not used by scirpy, but enable lossless conversions.
The downside is that there can easily be 100+ columns in adata.obs. Also serializing excess chains is not really elegant.
The advantage is that it works really well with scanpy, i.e. any AIRR variable can immediately be used for grouping, plotting etc.

New options are

mudata. AIRR data could be saved as a separate modality. Even if we keep the current reprepsentation of a wide data frame, it would at least not clutter the rest of adata.obs.
awkward array support. Allows storing an arbitrary number of values per row. See https://github.com/theislab/anndata/pull/647

The new representation should also aim at being a community standard for the scverse ecosystem and should build upon the AIRR rearrangement standard. Ideally, we could get additional stakeholders onboard, including conga, dandelion, tcrdist3 and possibly members of the AIRR community.

[x] what's the state of the AIRR single-cell schema? And what are its advantates over the rearrangement schema.

zktuong commented 1 year ago

i can help answer some of these questions (disclaimer - i still need time to familiarise myself with the new scirpy data structure but i'm familiar with the other main bits):

@gszep we talked about Repertoire metadata earlier (https://github.com/scverse/scirpy/issues/327#issuecomment-1172887702) but I think currently scverse libraries only load AIRR Rearrangements. Would love to hear that I am wrong, but I believe that to be the case.

The main thing is that the anndata.obs slot index is cell_id, whereas the airr data's index is sequence_id. So scirpy stores the airr data separately in the .uns and uses the awkward arrays to both store and retrieve the information. Although i believe the awkward arrays can theoretically hold any information in the AIRR row for each contig. So you could store those information if you have it, just that it won't be retrieved automatically.

I am still learning the best practices for anndata. Say we have a usecase of 20 repertoires each with 100,000 cells or so.

In this case, would it make sense to have 20 Anndata objects and in each anndata object there would be the 100,000 'Rearrangement' rows stored under the airr path and the AIRR Repertoire stored in uns or similar?

Yes. the thing to look out for in the airr table is that your sequence_id are unique. cell_id can be repeated in the airr table but i think internally the individual contigs will still be ranked by umi count to sort them from largest to smallest, so that the highest expressing pairs of heavy/light or long/short contigs would be the main bcr/tcr for that cell.

Yes, I would guess keeping them separate until you want to compare them, then concatenate them as required for specific comparisons. Still an interesting scalability question as to how many samples can you practically compare in a single scirpy object.

The io step and retrieval steps should be fast i believe?

gszep commented 1 year ago

The HDF5 file format was designed to have lazy constant time random access to any contiguous slices of your data. If these advantages are exposed in anndata.AnnData class then we shouldn't have to worry about running out of memory and can concatenate and compare as much as we want. However as far as I can tell:

Lazy concatentation of different files is still an experimental feature available via anndata.experimental.AnnCollection and there are issues being tracked for this https://github.com/scverse/anndata/issues/793
Lazy loading of partial data within a single file is supported for .X (maybe also .layers and .obsm?) using the backed="r" mode but it appears that .obs and .var are still all loaded into memory. I've create an issue to suggest supporting lazy dataframes https://github.com/scverse/anndata/issues/981
After the above two issues are solved, the lazy versions of plotting and analysis functions must be written. Hopefully some of them will work out of the box if the lazy dataframe closesly matches the pandas api.

grst commented 1 year ago

Thanks for all your comments!

What about Subject or Sample fields from AIRR such as the age of donor/patient or disease_diagnosis etc; metadata that would be the same for a subset or all of the cells under obs

These are usually just stored as additional columns in .obs. Represented as categoricals, this is quite memory-efficient.

Ideally one could request an h5ad download of multiple samples and get a single h5ad file with all the GEX, Cell, Rearrangement, and Repertoire metadata embedded in that single h5ad file. Currently when you do a download you get 4 separate files, one for each type of data.

I'm definitely open to add more reader functions to load other AIRR schemas than Rearrangement. Could you maybe point me to an example dataset that has all four data types? Then I'll take a look how to best represent them as AnnData.

I'll separately comment on scalability later.

grst commented 1 year ago

Regarding scalability:

Different steps in the scirpy pipeline are subject to different limitations. My goal is to enable improve scalability of scirpy such that analysis of (few) millions of cells is conveniently possible on a single workstation (e.g. >200GB RAM, >30 cores). This is tracked here: https://github.com/scverse/scirpy/issues/370.

For anything >10M cells, I believe we need to move to out-of-memory and out-of-core approaches. Solutions for this are still being figured out on the AnnData side as @gszep has pointed out. AnnData's current "backed mode" does not support layers and obsm. I believe random access of serialized awkward arrays requires some additional effort, but will be possible.

To get a better idea of current limitations of scalability, I'll share here my experiments with omniscope's longitudonal COVID19 dataset with 8M TCR-beta chains (and no gene expression data):

reading the rearrangement files: ~15min (can be further optimized & parallelized)
the index_chains function: 2:30h (I think I can speed this up significantly using numba, #386).
retrieving a data frame of all "junction_aa" sequences: 13s
```
ir.get.airr(adata, "junction_aa")
```
subsetting anndata and retrieving 10k "junction_aa": 34ms
```
ir.get.airr(adata[200000:210000], "junction_aa")
```
size of AnnData object on-disk (uncompressed): 23GB

The real bottlenecks of the scirpy workflow are further downstream:

levenshtein or alignment distance take ~24h on 44 cpus
clonotype calling takes ~24h on 1 cpu, parallelization as currently implemented is not effective.

bcorrie commented 1 year ago

I'm definitely open to add more reader functions to load other AIRR schemas than Rearrangement. Could you maybe point me to an example dataset that has all four data types? Then I'll take a look how to best represent them as AnnData.

Any chance you have an iReceptor Gateway account 8-)

I just did a download of the data from one subject from a cancer study from Yost et al (http://doi.org/DOI:10.1038/s41591-019-0522-3). The download contains three Repertoires from a single subject at different time points, two pre-treatment and one post-treatment. The data will have a rearrangement file, a cell file (AIRR JSON format), and a GEX file (AIRR JSON format) (as well as some other files). Each will contain all of the data of one type from all three Repertoires. That is the rearrangement file will contain rearrangements from all three repertoires (actually six repertoires in the rearrangement case, as they are split into TRA and TRB for each time point). It is a 10X study with ~4500 Cells across all three time points.

The ZIP file is 180MB so probably want to share it outside of github.

grst commented 1 year ago

Happy to announce that the new datastructure is now rolled out as part of scirpy v0.13. Please check out the release notes.

javh commented 1 year ago

Excellent!

scverse / scirpy

scverse datastructure for AIRR data #327