labthings / python-labthings

Python implementation of LabThings, based on the Flask microframework
GNU General Public License v3.0
18 stars 2 forks source link

Automatic serialization/deserialization of Numpy arrays #13

Open jtc42 opened 4 years ago

jtc42 commented 4 years ago


We should introduce into the spec common scientific data types, and sensible ways to (de)serialize them.

Start with Numpy arrays (as we've already done this, see link above).

Discussion: What other scientific data types might be useful?

ChasNelson1990 commented 4 years ago

So JSONs are definitely is a good way to deal with dicts and even pd.DataFrame objects but are they the right way to deal with np.ndarrays?

Also, many of our arrays might be more suitable in an xarray object? In which case using hdf5 as a base type makes sense. Xarrays reccomend netCDF (based on hdf5)

Pandas has hdf5 support (it might be an additional package, can't remember) and there's for more general use.

Worth considering compression though, which is not covered by most existing python hdf5 packages (I think).

jtc42 commented 4 years ago

So the logic here is that, because of the structure of the API, and especially the websocket-based data event stuff, we should be able to include data within a JSON object.

This is similar to how things like OME XML works, in that they have XML metadata which contains a binary blob.

I'm more than happy to have suggestions on encoding formats, but whatever we choose, it would be nice if it could be sensible embedded with a JSON object.

Np.ndarrays are nice because they're just C-contiguous arrays. The encoding we use in the OFM software gives you type information, array dimensions, and a base64 encoded binary blob of the array, which means it can be used outside of Python if you want.

That said, I'm not married to the idea.

We could move to a model of JSON linking to a separate binary file, but especially for small data sets, theres something appealing about the data and its metadata all being contained within a single object.

As usual, open to suggestions.

ChasNelson1990 commented 4 years ago

Hmm... that makes sense I guess... a quick google on why not use hdf5 came up with- This, and other links, do seem to suggest storing big blobs separately with JSON holding the metadata and so, for small things, just using JSON.

For reference, this matches the xarray system to I believe, the actual array is just a numpy array and the xarray object basically wraps that with a 'metadata' layer, i.e. like column names in a pd.Series.

jtc42 commented 4 years ago

Oh thats super useful thanks!

So it might be that we just add in xarray support (it does seem really sensible) which would serialise like:


        "@type": "ndarray",
        "dtype": << data type >>,
        "shape": << array shape >>,
        "base64": << base 64 encoded blob >>


        "@type": "xarray",
        "dtype": << data type >>,
        "shape": << array dims >>,
        "coords": << xarray coords dict >>,
        "attrs": << xarray attrs dict >>,
        "base64": << base 64 encoded blob >>

Then in cases where the data is being stored separately, we just return a link to the object binary (.npz (numpy), netCDF (xarray)) file.

ChasNelson1990 commented 4 years ago

Looks sensible.

glyg commented 2 years ago

Hi! For now, is it OK to copy/paste the openflexure serialization code on one's own code base? For smallish data, it seem sufficient to base64 encode an array directly in the response.

jtc42 commented 2 years ago

Hi! For now, is it OK to copy/paste the openflexure serialization code on one's own code base? For smallish data, it seem sufficient to base64 encode an array directly in the response.

Yeah you should be able to use whatever code you like as long as it's within the GPL license (