mikepound / opencubes

A community improved version of the polycubes project!
MIT License
44 stars 23 forks source link

Standard cube format. #8

Open bertie2 opened 1 year ago

bertie2 commented 1 year ago

Given we now have a second implementation language tentatively waiting as a pull request. it would be optimal to agree on a standard format for cubes at some point so that multiple implementations can share data. could any ideas for this format go here please.

mikepound commented 1 year ago

A few of the discussions in other issues have talked about a more efficient binary representation. If you continued with an xyz00010101010 style encoding, I think this would be unambiguous, and fast to read and write. This would also be quite space efficient, which would help with the other comments about the eventual ballooning size of the npy files. Not sure this is the best approach, but off the top of my head I wondered about

x     y     z     data
[byte][byte][byte][bytes...]

The size of the data bytes (padding required) would I guess be calculated as ceil(xyz)? You could shave a byte by encoding x y and z into the first 15 bits of [byte][byte] as 5 bytes each, with each max dimension then being 32, which seems plenty since the biggest dimension size is n.

You could forego padding but you'd end up having to extract all of these data from within bytes at strange offsets, in the end I don't know if that's a worthwhile approach when LZ compression would probably do?

bertie2 commented 1 year ago

I think that is pretty much optimal for the raw data structure.

however it might be useful to store some meta data on which particular version of the cubes (e.g. their orientation), and how many are stored in this file, so that programs don't have to do rotations when importing because they aren't sure if they use the same rotations as the creating program.

I'm proposing a file format, and have written some python code to convert to and from the existing file format. you can find the converter code at https://github.com/bertie2/opencubes/tree/file_format extension : .pcube data: binary header:

[4 bytes] [byte]      [byte]      [bytes...] [bytes...]
magic     orientation compression cube_count body

[4 bytes] magic: for uniquely identifying this format, takes the raw hex value

0xCBECCBEC

[byte] orientation: allows for different future canonical orientations current options are:

0 : UNSORTED : these are random rotations rotate them yourself to get whatever your canonical rotations are.
1 : BITWISE_HIGHEST_VALUE: the rotation of the cube which has the highest value when compared byte by byte in the cube format (see body format for details):

more orientations can be added if better methods for finding identical rotations are found.

[byte] compression: indicates whether the body data is compressed:

0: uncompressed data stream
1: gzip compressed data stream

[bytes...] cube_count: a LEB128 encoded arbitrarily large unsigned integer representing the number of cubes in this file / stream. https://en.wikipedia.org/wiki/LEB128 0 indicates that this is a variable length file / stream and will simply abruptly end.

body:

x     y     z     data
[byte][byte][byte][bytes...]
[byte] x:
an 8 bit unsigned integer representing the cubes x dimension
[byte] y:
an 8 bit unsigned integer representing the cubes x dimension
[byte] z:
an 8 bit unsigned integer representing the cubes x dimension
[bytes...] data:
the flattened bits of the polycube where 1 is a filled space and 0 is empty, flattened in row-major (C-style), stored in little endian, padded to the nearest byte with zeros.

as a note converting the n=12 file to pcube format reduces the size from 2GB to 180MB

datdenkikniet commented 1 year ago

I would like to propose a change so we can add writing blocks of cubes of the same size. This would entail:

  1. A bit flag in the header indicating this format (we don't have any reserved bits at the moment that we could use for this, so we need a flag-byte).
  2. For every block: an xyz in size, an LEB128 number indicating the amount of cubes of that size following, and then that data in the already-defined row-major C-style

This would mean reducing the amount of data stored per cube 3 bytes, and perhaps increase the file density significantly.

This would also allow for far more efficient in-file deduplication because you can plan for the size you need to read much more efficiently.

Edit: uh, right. We have 7 bits left over in the byte that stores the orientation... I propose we add the bitflag to that :P

datdenkikniet commented 1 year ago

As a note of why adding the "storing by blocks" may be worth it: performing that optimization on the N = 12 cubes file decreases the size from 175 MiB to 121 MiB without compression, and from 111 MiB to 80 MiB with compression. Constructing the file takes some time, but I think the size-reduction factor and the fact that reading the file in parallel becomes a lot easier is quite worth it.

bertie2 commented 1 year ago

agreed, i will open a v2 spec on a PR for the converter, use a different magic to differentiate v2 and add a 4 bytes of feature flags and support for in file blocks.

bertie2 commented 1 year ago

tools interacting with the files should preferably maintain backwards compat with the v1 format but thats optional, its not like this is a commonly re run project.

datdenkikniet commented 1 year ago

Including a markdown file with the cube format at the top level of the project may also be a good idea :) That way you can't miss it, and it's easy to figure out what the current "real" spec is :D

datdenkikniet commented 1 year ago

I have three more information points that would be excellent to have in the header:

  1. uniqueness => true if all cubes in the file are unique if transformed into some canonical form (no guarantee about the specific canonical form, unless rotation bit is also set).
  2. n => the size that all polycubes in this file have, or 0 if that is unknown or varies.
  3. (Optional, probably too specific) "at least 1" (n != 0) => the file contains at least one copy of all unique cubes of size N, but may contain multiple copies of individual cubes (possibly in different rotations, if orientation == 0)

This way you can deduce if a cache file contains all unique polycubes of size n, or the file can tell you outright if uniqueness == true && "at least 1" == true.

Edit: this assumes that these properties are not already enforced by the format, of course. If they are, that should probably be in the spec. Adding n would still be nice, that way you can "avoid" being forced to look at the first cube to figure out the number.

datdenkikniet commented 11 months ago

One last request: it would be really nice to have other compression mechanisms. gzip is quite slow, especially if you want to compress large sizes (the file I've generated for N=16 is around 900 GB, uncompressed). Other algorithms like brotli and/or zstd would be really nice to have.