linnarsson-lab / loom-viewer

Tool for sharing, browsing and visualizing single-cell data stored in the Loom file format
BSD 2-Clause "Simplified" License
35 stars 6 forks source link

Make loom files explicitly typed #47

Closed slinnarsson closed 8 years ago

slinnarsson commented 8 years ago

Add type information (i.e. a schema) to the loom file.

Benefits:

slinnarsson commented 8 years ago

Proposed design:

  1. In the loom file, store a schema JSON object as the HDF5 attribute schema on group /.

    The value of schema is a string in JSON format, giving the types of each attribute and of the main matrix. Example:

    schema = {
     "matrix": "float32",
     "row_attrs": {
        "GeneName": "string",
        "Chromosome": "string",
        "Position": "int",
        "GC_Percent": "float64"
     },
     "col_attrs": {
        "CellID": "string",
        "Tissue": "string",
        "Total_Molecules": "int",
        "Class": "string"
     },
    }
  2. The schema is immutable; the only way to change it is by adding attributes (incl. by adding an attribute with the name of an existing attribute, which overwrites it).
  3. The LoomConnection class gets a new property schema which returns the schema as a Python object with properties matrix (str), row_attrs (dict) and col_attrs (dict).
  4. When serialized to JSON, the schema is included separately (camelCased for JavaScript):

    fileinfo = {
       "project": "Published datasets",
       "dataset": "filename.loom",
       "filename": "filename.loom",
       "shape": [1200,4500],
       "zoomRange": [0,8,16],
       "fullZoomHeight": 54745,
       "fullZoomWidth": 43657,
       "rowAttrs": ...,
       "colAttrs": ...,
       "schema": {
                   "matrix": "float32",
                   "rowAttrs": {
                      "GeneName": "string",
                      "Chromosome": "string",
                      "Position": "int",
                      "GC_Percent": "float64"
                   },
                   "colAttrs": {
                      "CellID": "string",
                      "Tissue": "string",
                      "Total_Molecules": "int",
                      "Class": "string"
                   },
                }
    }
  5. The only valid type for matrix is float32. The only valid types for attributes are float64, int and string.
JobLeonard commented 8 years ago

Nice! The loom spec should probably include this documentation on the schema, no?

Actually, I have a related request regarding that, so I'll open a separate ticket.

slinnarsson commented 8 years ago

I updated the spec and also the loompy (Python package) documentation.