Closed fwkoch closed 3 years ago
Writing as pure JSON is impractical for large file sizes. Regarding the "pros" above -
less
), but easy access by all users in unrealistic.Regarding the "cons" -
Store JSON metadata and binary as separate files within a standard, zipped folder structure. The header/metadata would look similar to the metadata above, but instead of base64 inline, binary is specified by paths to files. Something like
my_file.omf
|- metadata.json
|- pointset_vertices_array.bin
|- surface_data/
| |- vertices.bin
| |- triangles.bin
| |- data.bin
| |- image.png
|- another_folder/
|- possibly_more_metadata.json
|- other_binary_file.bin
zip
, 7zip
, tar.gz
). By changing the file extension from .omf
to .zip
standard OS file browsers can open and view files..png
) or simple binary array with type info still stored in metadata..omf
file to users who are only concerned with the full file, not internal data.Zip Archive was decided in GMG workshop as the new standard.
Has any thought been put into using something like SQLite instead of a zip archive:
The interesting thing about using an SQLite database and storing BLOBs in fields in tables is that you can also attach metadata to the blobs in adjacent columns in the same rows. Entities like triangular meshes could have two tables, points and triangles and be stored directly.
Disadvantages would be extra dependencies on the sqlite libraries and also changing database schemas as the format evolves.
I suggest only allowing one JSON document per archive. An argument could be made for splitting up the JSON into multiple files but I think this may just add complexity for little benefit. As long as project.json is kept fairly small and quick to parse.
Another suggestion for the layout of the zip archive structure using UUIDs.
my_file.omf
|- project.json
|- point_set/
| |- 0f52a9b6-d641-ffea-99af-d2e840ca3187.bin (Vector3Array)
| |- some_data/
| | |- 0d4b91f1-a487-fa08-8838-047501d3764f.bin (ScalarArray)
| | |- f00a8e45-fbd5-f32f-ba59-6214392dadbf.bin (Int3Array)
|- stuff/
| |- af3403b9-9cf9-f614-a812-022bbba41a3a.bin (Vector3Array)
| |- 27d07ff2-ee67-f2a5-9698-932a3c313985.bin (Int3Array)
|- ...
The idea behind this layout is that only the stem component of the file names are used to identify the objects (UUIDs). This means that the "folders" (remember that zip folders aren't actually real, they're just a '/' in the compressed file name) are only used for organisation purposes and don't actually matter. It would therefore be valid to have the same file in the structure.
my_file.omf
|- project.json
|- 0f52a9b6-d641-ffea-99af-d2e840ca3187.bin (Vector3Array)
|- 0d4b91f1-a487-fa08-8838-047501d3764f.bin (ScalarArray)
|- f00a8e45-fbd5-f32f-ba59-6214392dadbf.bin (Int3Array)
|- af3403b9-9cf9-f614-a812-022bbba41a3a.bin (Vector3Array)
|- 27d07ff2-ee67-f2a5-9698-932a3c313985.bin (Int3Array)
|- ...
The downside here is that its less evident what things actually are. Should we prioritise manual editability? Maybe a terminal app to read, edit and write OMF files would be more useful. project.json's layout would be similar to what is currently written to the end of an OMF files except the binary types won't have start offset and length offset. We can retrieve information such as compressed and uncompressed size from the Zip archive (could be stored in JSON is suppose).
As for the suggestion for SQLite I believe Zip is better. One of the negatives of using SQLite blobs is that afaik they have a max size of 2GB.
Thanks for this @dbrookes96 - I pretty much agree on all your points:
Regarding your point around storing compressed / uncompressed size - probably worth storing this in the JSON explicitly, for validation of attribute vs. element geometry, independent of deeper reading of the zip archive?
I wanted to say I also agree that SQLite is a better option. However, I think we should also move away from specific implementations and consider the requirements.
SQLite with binary BLOBs certainly fits these requirements.
As for the binary data, Base64 works. Have you looked into Protobuffers? We've had good experience with them and found them to be the fastest and easiest way to serialize and deserialize objects while maintaining forward and backward compatibility.
@fwkoch, one more note that might impact the format. We generally recommend reading the block model level by level where a level is most likely a bench with the thinking that usually one needs a bench of data + a limited number of benches above or below. Rarely does one need to read the entire block model into memory.
With this line of thinking, perhaps the format should allow quick access to a level of the block model without the need to read the previous levels.
I must agree with @dbrookes96
Performance of SQLite will degrade rapidly if your table rows get large. (and are not of a fixed size). For this reason storing of binary blob data in rows of SQLight is strongly discouraged. Mostly apps that use SQLight do this with the sql db inside a container (zip or otherwise) and have binary data sitting next to the sqlight db with a reference from the db to the data.
One thing I find is being lost on a few people.
This is an "Interchange File Format."
Meaning, moving data from one software system to another. In which case you are usually exporting practically everything from one system and sucking everything into the other.
Random access, in my opinion, is not strictly necessary.
Cherry picking from the file, in my opinion, is not strictly necessary. If you want to cherry pick, just skip over the bits you do not care about. But you read the entire file, hopefully, in one pass.
Interchange files should be very temporary. They should only be used to migrate data and that is it. After that has been done, you delete the file. It does not, in my opinion, need edit capabilities. It does not 'need' to be fast (but no one will complain if it is.) It does needs to be 'easy' to move data between the systems and that should be paramount.
I agree with @Zekaric. As an interchange format that is intended for import and export the requirements would be different. The Zip/JSON format makes sense in this case.
Love the ideas here, especially something in line with @dbrookes96 suggestion, zlib ala
my_file.omf
|- project.json
|- data
|- 0f52a9b6-d641-ffea-99af-d2e840ca3187.bin (Vector3Array)
|- 0d4b91f1-a487-fa08-8838-047501d3764f.bin (ScalarArray)
|- f00a8e45-fbd5-f32f-ba59-6214392dadbf.bin (Int3Array)
|- af3403b9-9cf9-f614-a812-022bbba41a3a.bin (Vector3Array)
|- 27d07ff2-ee67-f2a5-9698-932a3c313985.bin (Int3Array)
|- ...
Have you considered to use HDF5? It is also a standard, open format that can contain all OMF objects and metadata. For large projects, other proposed solutions, such as JSON and SQLite can be inpractical.
If used as an Interchange format HDF5 could work. Slight gripe is that it is a much more complicated file. You really have to use their library which is fine considering the solution being worked on here is using libraries for ZIP file format handling.
My biggest gripe with the format, which I will admit is irrelevant for data sharing, is that deleting data from an HDF5 file doesn't shrink it. The file can only grow. Unless they fixed that but when I looked into the file for some other purpose it lacked that ability and the only way to shrink a file was with a very expensive "copy to a new file and delete the old one" sort of solution.
Context
The current format
OMF-v1
starts with a 60 byte header, followed by a JSON dictionary containing all elements of the project keyed by UID strings. This JSON is followed by a series of individually zipped binary blobs. Objects can reference each other by UID, and arrays and images contain pointers to their data in the binary blob.See the documentation here: https://omf.readthedocs.io/en/latest/content/io.html#omf.fileio.OMFWriter
The current format allows you to read the JSON data from disk, without loading all data items.
Problem
The current serialization format is not human readable out of the box, and requires a custom implementation to both read and write; this increases the barrier to entry for adopting OMF. Moving away from a custom binary format would simplify reading the file and make it easier to validate in multiple languages.
Proposed Solution
Move to Base64 encoding for the data-arrays and serve the entire file structure as JSON. This would keep the general structure of the binary data as individual resources.
Pros
Cons
Overlap with other issues
color
to metadata****Geometry
and flatten this to be on elements.__class__
totype
#42Example