openworm / tracker-commons

Compilation of information and code bases related to open-source trackers for C. elegans
11 stars 12 forks source link

Add an optional table of contents for multi-chunk WCON files #163

Open MichaelCurrie opened 7 years ago

MichaelCurrie commented 7 years ago

Say there is a WCON spread across 500 files ("chunks"). Going to timestamp X will require opening all these chunks unless there is a table of contents telling the reader which chunk to visit.

This table of contents should be auto-generated by the WCON writer, and reference what times and worms are stored in each of the other file "chunks". It will be optionally used by a WCON reader to open only the necessary chunk when doing random access of certain timestamps.

To retain backward compatibility it could be a top-level object called "@tableofcontents". Or perhaps it could be a change to the top-level "files" object, adding hash and data range info to each file entry.

Implementations using this should probably put the "@tableofcontents", "units", and "metadata" objects, and an empty "data" object, in the first chunk, and the real data objects starting from the second chunk, to ensure efficient loading of only the data required.

@vivekv2 may have an example of a table of contents in action.

Ichoran commented 7 years ago

This sounds like a good idea overall, but I would recommend that the list of contents go at the end rather than the beginning because either end is equally easy to find once it's all there, but if you want to write it in one sweep while taking data, you won't know what to put at the beginning. You can, however, tell that when you're done writing the first file that you're either completely done (so you write the ToC) or not (in which case you start writing the next file).

Ichoran commented 7 years ago

Also, it probably ought to go into the files object because it has mostly to do with the file structure, not the metadata.

Ichoran commented 7 years ago

How about files can have contents tag, which is a map from file names to an array of three-element arrays, where the arrays are ["wormID", 1.2, 3.4] where 1.2 and 3.4 are the earliest and latest times the worm is present, respectively? So, for example,

"files": {
  "current": "foo.2.wcon",
  "next":["foo.3.wcon"],
  "prev":["foo.1.wcon"],
  "contents": {
    "foo.1.wcon": [["1", 0, 10], ["2", 1, 7], ["3", 3, 10]],
    "foo.2.wcon": [["1", 11, 20], ["3", 11, 15], ["4", 17, 20], ["5", 13, 18]],
    "foo.3.wcon": [["1", 21, 30], ["4", 21, 28], ["6", 25, 28]]
  }
}

The recommendation would be to include this data in the last wcon file listed, but it could go anywhere.