parashardhapola / scarf

Toolkit for highly memory efficient analysis of single-cell RNA-Seq, scATAC-Seq and CITE-Seq data. Analyze atlas scale datasets with millions of cells on laptop.
http://scarf.readthedocs.io
BSD 3-Clause "New" or "Revised" License
98 stars 12 forks source link

Constructing a scarf Zarr dataset from scratch #54

Closed jamestwebber closed 3 years ago

jamestwebber commented 3 years ago

Hello there. I've been thinking a lot lately about Dask and scRNAseq analysis so I'm excited to give this a try! We have a ton of data to analyze and most tools aren't able to scale well enough, so having something that can distribute the work is great.

The first step I'm figuring out is how to get the data into the right format. I already have a very large dataset (10X snRNAseq) in Zarr format, stored as a cell-by-gene matrix of UMI counts, so I would prefer not to go through the CR import process but just tweak the layout to be compatible with Scarf. Is that reasonable?

I think I can figure this out from the docs but I thought I'd open an issue in case there's an easy answer. I am guessing it's just a matter of naming things correctly.

parashardhapola commented 3 years ago

Hi @jamestwebber,

This is absolutely possible. I have a question first, is there any other reason than time-consumption that you would like to avoid the CR import functions?

If you haven't already, I would suggest you read this vignette about Zarr organization

So, a minimal Zarr hierarchy in Scarf looks like this (this data has 892 cells and 36601 features): / ├── RNA │ ├── counts (892, 36601) uint32 │ └── featureData │ ├── I (36601,) bool │ ├── ids (36601,) <U15 │ └── names (36601,) <U17 └── cellData ├── I (892,) bool ├── ids (892,) <U18 └── names (892,) <U18

The top folders RNA and cellData are simply Zarr groups. You must add the following two attributes to the RNA group like this:

z.RNA.attrs["is_assay"] = True
z.RNA.attrs["misc"] = {}

where z is the Zarr object.

counts under RNA group is the actual chunked Zarr dataset. You can copy your Zarr matrix chunks into this folder. Please note that counts itself is not a group.

RNA contains a Zarr group called featureData which must contain atleast these three Zarr datasets: I, ids and names. You can create these three datasets and the group itself like this:

from scarf.writers import create_zarr_obj_array

g = z.create_group("RNA/featureData")
create_zarr_obj_array(g, "ids", feat_ids)
create_zarr_obj_array(g, "names", feat_names)
create_zarr_obj_array(g, "I", [True for _ in range(len(feat_ids))], "bool")

You will need feat_ids and feat_names which are basically arrays containing unique IDs and names for the features. You can use feat_ids as feat_names as well.

The last step is to populate the cellData group with three datasets: I, ids and names.

g = z.create_group("cellData")
create_zarr_obj_array(g, "ids",  cell_ids)
create_zarr_obj_array(g, "names", cell_ids)
create_zarr_obj_array(g, "I", [True for _ in range(len(cell_ids))], "bool")

cell_ids are usually the cell barcodes.

With this you basically have the Zarr file ready to be loaded into Scarf.

Please let me know if something is not clear and needs to be explained better.