openpipelines-bio / openpipeline

https://openpipelines.bio
MIT License
27 stars 13 forks source link

Data format specs after mapping #102

Open rcannood opened 1 year ago

rcannood commented 1 year ago

This is a first attempt at deriving a data format specification.

Once we figure out some of the APIs, we could include these in our config.vsh.yaml definitions (similar to https://github.com/openproblems-bio/openproblems-v2/tree/main/src/label_projection/api )

After Cell Ranger or BD Rhapsody mapping

obs:
  index # cell id
  sample
  cell_type # human-readable name
  organism # ?
  tissue # ?

mod:
  # gene expression
  rna:
    layers:
      counts
      velocity_spliced # ?
      velocity_unspliced # ? 
    var:
      index # feature_id, preferably an ensembl id
      feature_name

  # Antibody Capture
  prot: 
    layers:
      counts
    var:
       index # feature_id
       feature_name # Associated protein names

  # IR receptor data
  vdj: 
    obsm:
      vdj_t
      vdj_b

  # Custom Capture
  custom:
    X: # raw counts

uns:
  sample_info: # dictionary of data frames, every data frame has a 'sample_id' column
    cellranger: h5attributes(h5)
    qc: # Data frame with columns:
      sample_id # corresponds to .obs["sample_id"]
      component_id # the component that generated these qc values, e.g. mapping/cellranger_count
      category # 10x example [Cells, Library], BD example [Sequencing Quality, Library Quality, ...]
      group_name # example 'ABC_1'
      metric_name
      metric_value # numerical values, example 1000, 0.1 -- strip % signs
  param_log: # list of dicts
    - pipeline_id
      component_id
      component_version
      id
      params: { input: ..., output: ..., arg1: ..., arg2: ... } # not the full path of files should be stored, only the base names

After single sample RNA

mod:
  rna:
    obs:
      doublet_prob
      doublet_score
      doublet_bool
      <standard names for scanpy calculate qc metrics>
    var:
      <standard names for scanpy calculate qc metrics>
    layers:
      ambient_corrected_counts

After multi sample RNA

New fields:

mod:
  rna:
    var:
      highly_variable ( boolean )
    layers:
      normalized

After integration RNA

New fields:

mod:
  rna:
    obs:
      cluster
    obsm:
      X_pca
      X_integrated
      X_umap
    obsp:
      connectivities
      distances
    uns:
      neighbors: # for compatibility with umap
        connectivities_key
        distances_key
        params: { ... }

After annotation

Since it could be used across modalities, so should be able to output in the root of the mudata.

obsm:
  annotation_scvi: # data frame with the predictions and scores?
  annotation_bbknn: # data frame
  # all in one: with just the predictions?
  annotation:
    prediction_scvi
    prediction_bbknn
    ...
uns:
  ...?

WIP!

Logging QC metrics

uns:
  sample_info: # dictionary of data frames, every data frame has a 'sample_id' column
    cellranger: h5attributes(h5)
    qc: # Data frame with columns:
      sample_id # corresponds to .obs["sample_id"]
      component_id # the component that generated these qc values, e.g. mapping/cellranger_count
      category # 10x example [Cells, Library], BD example [Sequencing Quality, Library Quality, ...]
      group_name # example 'ABC_1'
      metric_name
      metric_value # numerical values, example 1000, 0.1 -- strip % signs

Logging execution

uns:
  param_log: # list of dicts
    - pipeline_id
      component_id
      component_version
      id
      params: { input: ..., output: ..., arg1: ..., arg2: ... } # not the full path of files should be stored, only the base names
rcannood commented 1 year ago

An (incomplete) overview is included on the website: https://openpipelines.bio/guide/data_api.html