Closed jamestwebber closed 3 years ago
Hi @jamestwebber,
This is absolutely possible. I have a question first, is there any other reason than time-consumption that you would like to avoid the CR import functions?
If you haven't already, I would suggest you read this vignette about Zarr organization
So, a minimal Zarr hierarchy in Scarf looks like this (this data has 892 cells and 36601 features): / ├── RNA │ ├── counts (892, 36601) uint32 │ └── featureData │ ├── I (36601,) bool │ ├── ids (36601,) <U15 │ └── names (36601,) <U17 └── cellData ├── I (892,) bool ├── ids (892,) <U18 └── names (892,) <U18
The top folders RNA
and cellData
are simply Zarr groups. You must add the following two attributes to the RNA
group like this:
z.RNA.attrs["is_assay"] = True
z.RNA.attrs["misc"] = {}
where z
is the Zarr object.
counts
under RNA
group is the actual chunked Zarr dataset. You can copy your Zarr matrix chunks into this folder. Please note that counts
itself is not a group.
RNA
contains a Zarr group called featureData
which must contain atleast these three Zarr datasets: I
, ids
and names
.
You can create these three datasets and the group itself like this:
from scarf.writers import create_zarr_obj_array
g = z.create_group("RNA/featureData")
create_zarr_obj_array(g, "ids", feat_ids)
create_zarr_obj_array(g, "names", feat_names)
create_zarr_obj_array(g, "I", [True for _ in range(len(feat_ids))], "bool")
You will need feat_ids
and feat_names
which are basically arrays containing unique IDs and names for the features. You can use feat_ids
as feat_names
as well.
The last step is to populate the cellData
group with three datasets: I
, ids
and names
.
g = z.create_group("cellData")
create_zarr_obj_array(g, "ids", cell_ids)
create_zarr_obj_array(g, "names", cell_ids)
create_zarr_obj_array(g, "I", [True for _ in range(len(cell_ids))], "bool")
cell_ids
are usually the cell barcodes.
With this you basically have the Zarr file ready to be loaded into Scarf.
Please let me know if something is not clear and needs to be explained better.
Hello there. I've been thinking a lot lately about Dask and scRNAseq analysis so I'm excited to give this a try! We have a ton of data to analyze and most tools aren't able to scale well enough, so having something that can distribute the work is great.
The first step I'm figuring out is how to get the data into the right format. I already have a very large dataset (10X snRNAseq) in Zarr format, stored as a cell-by-gene matrix of UMI counts, so I would prefer not to go through the CR import process but just tweak the layout to be compatible with Scarf. Is that reasonable?
I think I can figure this out from the docs but I thought I'd open an issue in case there's an easy answer. I am guessing it's just a matter of naming things correctly.