[H5AD Upload] Plan H5AD upload epic

gerbeldo commented 10 months ago

https://docs.google.com/document/d/1r1AWYk_zvhsWuNYfg0lf5jkHKh49uAnrtmeVZ7uff5M/edit?usp=sharing

alexvpickering commented 9 months ago

Thanks @ogibson for the well written design document. Can you provide some details about the conversation that led to deciding on initial compatibility with nf-core Alevin-Fry (the linked conversation is private)?

My main concern is the initial focus on supporting the output of nf-core Alevin-Fry. This seems like a very specific choice that may only benefit a few users. Would it not make more sense to support the most commonly used format from scanpy (what you would get if you follow their tutorials) or the format that is used by CellXGene?

It seems like nf-core Alevin-Fry may already produce 10X h5 files. The code that you linked to first reads in either the 10x h5 or mtx files and then just dumps the count matrices into an h5ad file. Why not just ask these users to upload the 10x h5 or mtx files?

As I understand it, Alevin-Fry does not do any downstream processing (doublet detection, integration, embedding, clustering, etc) other than maybe inflection-point based filtering. In the design document, you specifically mention that the interesting thing about H5AD support is to allow the user to explore already processed datasets (so that they can run DE analysis, plots, etc on their their pre-calculated embedding/clusters). I agree with that statement and, as a result, support for Alevin-Fry h5ad upload seems totally beside the point.

ogibson commented 9 months ago

I agree with your comments, Alex. The decision to support Alevin-fry was made based on some Cellenics users with the requirement, but my initial hunch was that CellXGene is the way to go, and I think it makes sense.

Additionally, if we implement CellXGene support, moving on to Alevin-fry down the line would be almost trivial.

I will modify the document to reflect this conversation.

alexvpickering commented 9 months ago

Sounds good -- It looks like CellXGene expects a structure similar to what you would find for scanpy. I would just add that there are a couple of other things that we will need in order to provide full support for downstream Cellenics features:

a column within the obs DataFrame that indicates sample identity (needed for DE analysis and to detect other columns that are for sample-level metadata as well as to distinguish between sample-level metadata and cluster metadata)
raw count matrix must be present (need for pseudobulk DE between groups -- logcounts or similar are not appropriate)
likely a pca reduction (worth investigating if we can do away with it or if it is appropriate to just calculate it if needed -- I think it might be necessary for trajectory analysis)

If we have the above, everything else (sample and cell level metadata) should be straightforward to auto-detect using the existing functions for Seurat object upload. It could really be almost identical to Seurat object upload except it would be extracting from an H5 file instead of an rds file. Any security concerns would also be similar.

I personally wouldn't specify in the UI that it should be in CellxGene format but rather just indicate what is required. The above requirements for Cellenics are not specified by CellxGene (they don't require raw counts or a standard column name that indicates sample identity). There are parallel things we could do to improve CellxGene support for both Seurat .rds as well as .h5ad uploads (CellxGene datasets can be downloaded as either). In particular, sample identity is specified using a variety of column names depending on the investigator (sample, donor, sample_id, patient, patient_id, etc.). We could trawl through the datasets and identify all the column names to come up with a good regex.

hms-dbmi-cellenics / issues

[H5AD Upload] Plan H5AD upload epic #35