mountetna / magma

Data server with friendly data loaders
GNU General Public License v2.0
5 stars 2 forks source link

Add key-value attribute #75

Closed graft closed 3 years ago

graft commented 6 years ago

This is motivated by lessons learned from the "gene_exp" table from the magma-ipi project, which is extremely large and unwieldy. Each rna_seq record has ~60,000 gene_exp records, which makes the gene_exp table large, inserts slow, indexing difficult, queries painful, etc.

Instead, we may store this data in a single column on the rna_seq record - a JSON key-value blob.

Magma should contain a new attribute to hold this sort of data. I am not sure what to call it - vector? map? Assuming 'map' just so I can write this:

A map stores a Hash, e.g. { key1: "value1", key2: "value2" }. The data storage is simple enough - Postgres supports JSON types and can perhaps even index keys. The storage should be flat - this is not a way to sneak in arbitrary JSON blobs, the values should all have the same declared type (e.g. map :expression, type: Float) and there should be no nesting.

The remaining questions are best illustrated by the behavior of the endpoints:

1) /retrieve should simply return data as a hash (e.g. for map :expression, the json document for a record contains { "expression": { 'EGFR": 10, "BRAF":20 ... }, etc. TSV download can either (1) return a compressed string (EGFR:10,BRAF:20,...) or (2) return nothing, which is currently the behavior for table attributes.

2) /query should feature a new 'map' predicate, another type of column predicate. It would have two verbs: verb nil - returns the data, probably as a list of key-value pairs which the manifest can turn into a Vector verb '::subset', Array - returns a subsetted version of the data with only the keys in Array

3) /upload should accept a JSON hash as input, or perhaps a JSON Vector format. As for loading, we might write a 'MatrixLoader' to upload this sort of data.

dabrau commented 6 years ago

I did some experiments on the bottleneck for the queries on my local machine. I seems that all the queries are sorted by id. If they table is large enough it will create a bottleneck because it resorts to an "EXTERNAL MERGE" sort that requires more disk I/0.

For the case where we are merging a relatively small rna_seq table to a large gene_exp table, the time it takes to join is negligible compared to the time it takes to order. I think storing attributes as a document will provide value where it makes sense to model it that way, but I don't think this is needed to solve the current problem with the gene_exp table.

I am proposing to

I think this will make querying gene_exp table in the manifest useful in the near future and we can table the key-value attribute.

graft commented 5 years ago

Re-visiting this as a significant near-term issue to solve; it would be nice to have gene exp be fast and snappy. Some insights recently:

1) There is not that much data - 1000 samples x 6 compartments x 60000 genes is 360 million datapoints. This is a lot, but can be represented compactly in memory.

2) A flat representation is simply inefficient. We rarely want the data flat; many people have remarked on how this is a bad idea, leads to duplication of data etc. There is no need, for example, to store the same 60000 gene names or ensembl names over and over again.

A more compact representation stores the data in matrix format: rownames are samples, column names are genes. Each dataset (e.g. rnaseq tpms) is a slice of this matrix by rows or columns.

How can this data be stored? So far we have three possibilities: 1) An array-based database like TileDB or SciDB. 2) Postgres JSON column 3) Postgres Array column

graft commented 4 years ago

The easiest solution is a Postgres JSON column. There might be some expense in deserializing that much JSON - my rough test suggests 2000 records with 60000 key-values would take about 100 seconds to fully expand. That isn't terrible by any means. Furthermore once this has been inflated we can persist it in-memory, so that it only needs to expand once, and thence can be quickly retrieved from memory. There's a small cache-invalidation issue but we can use the record updated_at to avoid that problem.

graft commented 4 years ago

If this value is intended to function like a row from a matrix (that is, with fixed, ordered columns), ostensibly the keys, corresponding to the column names of the matrix, are redundant - that is, they do not need to be stored each time, but can be stored only once, and data can be stored as an array without keys.

This means that there needs to be a place to store and update keys; since this is table metadata it would logically belong in a table record (presently the hard-coded model, but following #104 it could live in a model table). This seems somewhat cumbersome and prone to error. It also means data that must be loaded in a particular order, and the user is responsible for ensuring that this corresponds to the key order for the model (e.g., if it is a list of gene names A, B, C, the user must load values 1, 2, 3 corresponding to A, B, C).

Alternatively we might eschew the table metadata and store keys in each individual record. This allows the possibility of the map to be used to store heterogeneous sets of values (e.g., { A: 1, B:2 } in one record, { B: 3, C: 4 } in another). It's difficult to understand how you could validate keys in this scenario, or compose these disparate rows into a matrix - how would you determine column names? first entry? consensus?

The answer is clear: there must be metadata, and the column ordering must be fixed. This also means we are free to use an Array data type since we don't need a JSON column (although a JSON array might be simpler and less Postgres-dependent). Ideally we could pull the metadata from another table (e.g. ensembl::gene::gene_id); for the moment we might have to hard-code them.

The submission can come in several forms: 1) A Matrix loader comes in square TSV format (i.e. not flattened) and can load data directly into each record (if they exist). Row names (identifiers) and column names (keys) could be validated here.

2) An update to a single record. We might still request that the submission come in the form of a hash, rather than an array, which allows us to (1) order the items correctly and (2) ensure the user is using the appropriate keys.

graft commented 3 years ago

Closed by #106.