openworm / tracker-commons

Compilation of information and code bases related to open-source trackers for C. elegans
11 stars 12 forks source link

How to stream WCON? #156

Closed MichaelCurrie closed 7 years ago

MichaelCurrie commented 7 years ago

One fundamental issue with the JSON format we've chosen is that because the data is stored as text without any underlying structure, the file can only be streamed as a time-series, (and barely since t is stored as a separate series). To stream it along any other dimension requires first processing the entire file.

This is not a problem for 15-minute videos of skeleton and contour data, which might be 100 MB in size.

But if a lab wants to record time-series feature information alongside the skeleton and contour data, and have the recording be of multiple worms, etc, the file can get too large to open in memory.

The file size is (very) approximately 6 MB x (minutes) x # (of worms) x (# of features), so 10 hours of video with 20 worms and 10 features might be 6 x (10 x 60) x 20 x 100 = 720 GB.

The WCON format is convenient if we wish to look at a single worm or data series over the entire time series, but if my use case involved looking at a particular subset of frames, having to load the entire file is rather inconvenient.

Furthermore, even streaming the data is difficult because the timestamps are in a separate series from the coordinate information.

Currently the Python implementation looks at one worm at a time, but it doesn't attempt to break the file down further when processing, so it cannot even stream along the time dimension, only along the worm dimension.

Is there a solution to these problems? HDF5? @ver228, @cheelee and I were discussing this today.

Thanks

Ichoran commented 7 years ago

This issue can be solved by chunking the data in time across separate files. Then you need only load those files which cover the appropriate time windows. (You can embed a table of contents in the metadata under a custom tag if you need to to figure out what is going on.)

MichaelCurrie commented 7 years ago

OK thanks @ichoran I will close this for now, but I may have more questions about this later.

MichaelCurrie commented 7 years ago

In principle what you describe is possible, but the way I've implemented the Python version, the data is loaded all at once, including for things like the origin offsets. It would require a complete re-think, probably without pandas to get it to work in a chunked, streaming fashion.

aexbrown commented 7 years ago

Did we not include the 'next file' and 'previous file' functionality as part of the base description? If people take advantage of that structure to pre-chunk files, the issue with large files shouldn't arise. It would be great to support streaming eventually, but it doesn't seem to be high priority at this point.