Storage and transmission of large arrays of multivectors, ie mv file format of some sort

hugohadfield commented 6 years ago

As part of my research I have to work with large arrays of multivectors, often they are blades and thus very sparse. Currently I have no really good solution to storage and transmission of these arrays.

Ideally we would have some kind of file format that encodes the sparse multivectors and contains header information about what algebra they come from etc.

Before I put too much thought into this in isolation I thought I might consult the wider community and see if we can come up with something/adopt some existing format that others might know better than me!

So @arsenovic @moble @enkimute @tingelst @utensil @chrisjldoran @darkshikamo @penguian anyone got any smart ideas for a good numerical multivector file format that we might aim to make a standard?

hugohadfield commented 6 years ago

@eric-wieser you might be interested in this as well

hugohadfield commented 6 years ago

I'm currently thinking https://www.hdfgroup.org/solutions/hdf5/ might be a good way to go

arsenovic commented 6 years ago

i dont have much experience with IO, but you might look at xarray .
they support dask , which might be interesting to you.

hugohadfield commented 6 years ago

xarray looks interesting!

Here is an example of writing to hdf5: https://gist.github.com/hugohadfield/7dfa44baa7817996514a52a2e6e3fce7

not totally sure what metadata we would like to store with our algebras, will have a go at writing the loading of an algebra off disk and that will probably clarify what needs to be included

moble commented 6 years ago

I've done a lot of work with HDF5 (in fact, I recently contributed to h5py), and I think it's a very good option. It's so widely used and actively supported that you know it will be stable and available for a long time. It's also designed to be very flexible and descriptive, so you can do things like add attributes to describe the data — e.g., signature of the algebra, subtypes like bivector or spinor if your dataset is restricted like that, etc. The compression can be very good. So good, in fact, that I bet you don't even need to think about the sparseness because all those zeros would be compressed very efficiently. This would also increase compatibility if people with different codes (e.g., Hugo in 2025) want to use these data files. And if you expect any compressibility, I would just set that default to True. But I also expect that a compression level of 9 is overkill. It may just be due to the very different types of data I've used, but I've never found much improvement in size over level 1 — though the time to compress increases rapidly over level 5 or so. Of course, you should probably do some tests to check these things for realistic datasets you might use.

Numpy could be another good alternative, especially if you have a sparse data format — e.g., pairs of int and float representing blade and coefficient. Numpy can represent those as custom dtypes with something like

a = np.empty((29), np.dtype([('blade', np.uint16), ('coefficient', np.float64)]))

Alternatively, you could also rely on compression and just save and load the non-sparse representations. Either way, you can save it with something like

np.savez_compressed('file/name', a=a)

and load it again with

with load('file/name.npz') as data:
        a = data['a']

I don't know of any simple or standard way to store attributes, but your approach of including the metric and basis_names could also be done here. And if you do go with a sparse format in a custom dtype, the save/load process preserves that dtype.

All three of these — h5py, numpy, and xarray — are standard packages that come with a basic anaconda installation, or are easily installed via pip. Personally, I'd lean toward an HDF5 format, because it is so widely available, stable, and flexible, and isn't restricted to python.

moble commented 6 years ago

I wrote up a quick comparison of the various numpy and h5py compression options here. I constructed an array of 1000 random spinors in 6-d. In general, I don't see any way you could construct a sparse data format that is clever enough to get more than a 50% space savings. And yet, simple compression does even better than that in almost every case. Maybe you could squeeze out a little more space with both a clever sparse representation and compression, but I'm skeptical. And since compression is so much easier to code, I'd say it's a clear winner.

enkimute commented 6 years ago

For high dimensional algebra´s (just been playing with QCGA - in R9,6 - a 32768 dimensional vectorspace) there may be other considerations besides compress-ability. Those algebra's are not practically implementable on todays hardware without a sparse representation. Perhaps the choice of fileformat for today should not be focused on the spaces of interest today, but leave some room for our future selves. A forced non-sparse format would mean scanning 32k numbers to find the 10 that are not zero.

If the choice is for hdf5 perhaps we could get best of both worlds, represent multivectors as multidemensional arrays, and augmenting each of these datasets with attributes to specify which of the basis elements are in that particular dataset. That way sparse items of the same 'type' go together in the same array and when working in low dimensional spaces of just 5 or 6 dimensions one could choose to simply have just one big data set with all coefficients included for each multivector.

This keeps the simple low-dimensional case the same - just one block with multivectors - all components present, but leaves room for situations where it is not such a good idea. (its quite impossible for infinite dimensional spaces ;))

hugohadfield commented 6 years ago

Nice write up @moble and good points from @enkimute. I like the idea of hdf5 a lot as it is multi language and very flexible. How about this for a v0.0.1 hdf5 file:

Attributes: version: 0.0.1

Datasets: metric (dense array of doubles): The inner product table for the algebra basis_names (array of strings): The basis names for the algebra data (dense array of doubles): The multivector coefficients data then additionally has attributes: sparse (bool): If true then we provide the support, If false then assume the data is dense and ignore support support (dense array of uint64): The indices of the non zero coefficients in the multivector transpose (bool): If true the data is stored transposed to improve compression for columns of zeros

Each file can have multiple data fields allowing us to store several differently sparse arrays of objects in the same file.

How about that? Any ideas/suggestions? Do we need any other data to reconstruct the algebras etc?

penguian commented 6 years ago

Hi all, Sorry for my late response. I have not really considered serialization, but would have done something simpler, such as using a Python dict, and using one or more of Pickle, CSV or JSON to serialize it.

This is because the Clifford algebras that GluCat deals with are simply defined as an algebra whose object values are mappings from a set of finite sets of non-zero integers to a field, such as the real numbers, and whose generators anticommute and square to either -1 or 1.

If you want to be fancy, you can re-label the generators using a 1-1 mapping.

Basic layout of such a value dict:

{ {-30,-21,3,5}: 21.3, {-1,1}: -12.222, {3}: 1.0, {5}: 2.0 }

represents the value 21.3{-30,-21,3,5} - 12.222{-1,1} + 1.0{3} +2.0{5} in R_{-30,-21,-1,1,3,5} which is isomorphic to R_3,3.

Here the sets are subsets of {-30,-21,-1,1,3,5} where the set is written in canonical increasing order. Generators with negative indices have square -1. Generators with positive indices have square +1.

The fancy layout would be (e.g.): { 'gen': {1: -30, 2: 3, 3: -21, 4: 5, 5: 1, 6: -1}, 'val': { {-30,-21,3,5}: 21.3, {-1,1}: -12.222, {3}: 1.0, {5}: 2.0 } }

Here the value is still the same and means the same thing. The 'gen' dict defines a mapping from an arbitrary generator naming convention to the actual indices used in the index sets. When the value is printed, each index set is translated into the corresponding fancy generator set, which is then ordered, taking sign into account.

For example, 21.3{-30,-21,3,5} - 12.222{-1,1} + 1.0{3} +2.0{5} becomes

21.3 e_1324 - 12.222 e_65 + 1.0 e_2 + 2.0 e_4 which becomes

-21.3 e_1234 + 12.222 e_56 + 1.0 e_2 + 2.0 e_4.

To save space, if an entire sequence of values uses the same arbitrary generator naming convention, then the fancy layout would use a list: { 'gen': {1: -30, 2: 3, 3: -21, 4: 5, 5: 1, 6: -1}, 'val': [val_1, val_2, ...] } where val_1, val_2 are basic value dicts, as above.

How does HDF5 specify a map from integer sets to floating point numbers? All the best, Paul

On Thursday, 18 October 2018 9:53:48 PM AEDT hugohadfield wrote:

Nice write up @moble and good points from @enkimute. I like the idea of hdf5 a lot as it is multi language and very flexible. How about this for a v0.0.1 hdf5 file: Attributes: version: 0.0.1 Datasets: metric (dense array of doubles): The inner product table for the algebra basis_names (array of strings): The basis names for the algebra data (dense array of doubles): The multivector coefficients data then additionally has attributes: sparse (bool): If true then we provide the support, If false then assume the data is dense and ignore support support (dense array of uint64): The indices of the non zero coefficients in the multivector transpose (bool): If true the data is stored transposed to improve compression for columns of zeros Each file can have multiple data fields allowing us to store several differently sparse arrays of objects in the same file. How about that? Any ideas/suggestions? Do we need any other data to reconstruct the algebras etc? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

-- Paul Leopardi - https://sites.google.com/site/paulleopardi/

hugohadfield commented 6 years ago

I had a go with h5py and this library, accidentally committed and pushed it to my code generator branch but oh well: https://github.com/pygae/clifford/commit/ce5e2f5b3bfe0113147e131d6eb725f547ca98ea It seems to work quite nicely and was pretty easy to do!

@penguian GluCat sounds like it has a very different interface but presumably would work in a very similar way for reading and writing multivectors from this file format, you would just specify the fancy names of the basis vectors as the basis names and could add some user data into the hdf5 file to map directly between the fancy names and the original set? Hope that makes sense but I may have misunderstood how GluCat works...

pygae / clifford

Storage and transmission of large arrays of multivectors, ie mv file format of some sort #61