fwkoch commented 5 years ago

Context

The current format OMF-v1 starts with a 60 byte header, followed by a JSON dictionary containing all elements of the project keyed by UID strings. This JSON is followed by a series of individually zipped binary blobs. Objects can reference each other by UID, and arrays and images contain pointers to their data in the binary blob.

See the documentation here: https://omf.readthedocs.io/en/latest/content/io.html#omf.fileio.OMFWriter

The current format allows you to read the JSON data from disk, without loading all data items.

Problem

The current serialization format is not human readable out of the box, and requires a custom implementation to both read and write; this increases the barrier to entry for adopting OMF. Moving away from a custom binary format would simplify reading the file and make it easier to validate in multiple languages.

Proposed Solution

Move to Base64 encoding for the data-arrays and serve the entire file structure as JSON. This would keep the general structure of the binary data as individual resources.

Pros

JSON is a standard file format and should not need too much overhead from other libraries to import.
Human readable out of the box
Easier to validate

Cons

It is more difficult to load individual pieces of the project (e.g. just the points) without loading the entire JSON into memory.

Overlap with other issues

Metadata storage #20, e.g. moving color to metadata
Geometry removal #37, e.g. removing the ****Geometry and flatten this to be on elements.
Origin to meta data #21 to be units/geolocation.
Rename __class__ to type #42

Example

{
    "version": "OMF-v2.0.0",
    "project": "a8640832-2fa1-4aac-9618-fd032e970cbc",
    "resources": {
        "a8640832-2fa1-4aac-9618-fd032e970cbc": {
            "type": "Project",
            "author": "",
            "description": "Just some point data.",
            "elements": [
                "9e79411b-1e52-4c75-9091-32e30e249b3b"
            ],
            "name": "Test project"
        }
        "9e79411b-1e52-4c75-9091-32e30e249b3b": {
            "type": "PointSetElement",
            "name": "Random Points",
            "color": [0, 128, 0],
            "data": [
                "cb1f0f12-c218-4944-a4d0-74a942514eca"
            ],
            "description": "Just random points",
            "geometry": "87d08023-b5e1-4db2-b8a6-598bc84cedff"
        },
        "cb1f0f12-c218-4944-a4d0-74a942514eca": {
            "type": "ScalarData",
            "array": "b034fbac-b95d-4816-8d45-dfd61e5a1d57",
            "description": "",
            "location": "vertices",
            "name": "Random Data"
        },
        "b034fbac-b95d-4816-8d45-dfd61e5a1d57": {
            "type": "ScalarArray",
            "dtype": "<f8",
            "size": 100,
            "array": "MKQWT6cC5T+m8n6oa+rvP7Bu/GlLrck/B1s9sBs94j9mbEwv5FXvP/Ze0TDgBtM/SaYMbMSP6z9o\nkt5kW6rvP3iD3CNyj7s/0L96ShoCzT+XA1kYzw7nP4TNOppi+eo/Mjhf18pX2D9oWxrCNFTnP1Sn\nJVeC2tU/sJtsVjVqyD8Eo2UoeArVPylj2XaSous/LAvvrmbTwz99OvkQ6G7lP4BTdnchgpM/YeB7\nBqkS6j9sivqPOU3SPxoE//+9EdI/W/v2DZ2/5j+IaSAqAw7iPyl3/8oXMOQ/sCPPh2Lr5j+qsune\nROTeP5QPOyiYmdw/CFwHithBxD/ga2NNfmuvP6SPpq1ZUMo/pRmyQ41i7D/0VbdFJp3lP05VOUq6\nTdg/Nx/lKgg16j++SDM/vIDSPxLxNmcP8+Y/eG+nU/a/tj/zBkbF6UvhP2TCNL88iNk/G0PicHGA\n6j+3soAr45XpP14eSLcjDu8/IrvYdkgH3z+S6JYgAlbVPwgJAEynVb4/3AcbIMaO2T/IGyPmeqLh\nPyGrD9kpZew//nE9YU2p2T/h7Bcs3x/mP3fVIRSfNec/vJD7hyIm7z+gcsfV01m7P09bh0yKzeQ/\nEm/FUW275j+QsCpswOW1P037ilklMus/vHb4Gx4C0j8YGoyre63OPzz1Qh1mnOY/nEfGZ9LQ0T92\nBxTFJHXmPyCteWGOUc0/WDXdoXBAxD/sXwYdWMbFP1k2rE0oluY/AMb4v2J8Uj8Jv5ybny3vP8De\n2MCokp4/6PdXB+hH4z/MCSlvmb3vPx0ACBUnEuc/u1W26GW+6D+ChJK1uw7XPyCN2mMd3pU/8lpW\nxl+R7j+cA7mN0KDGP4Iccz03ydg/kP6SW8EJuz9EXliqXzbqPwKRAvlgAeQ/MOfJeIr8yD/H2Zql\ntyjrP8BtHxpXlbo/oJvq5aN4yj8/4N5MGT3sP5B0CUFwpeU/nmEgUSu40z8QFcy3bIO4PzhwoCD+\nbcE/0F2upMOwxj++w1z5HffcP80M4Xd0BOw/dCRFzDWS4D/mbnkwt2rvP1f6w/J6iuY/cLlXG1/a\n5D8="
        },
        "87d08023-b5e1-4db2-b8a6-598bc84cedff": {
            "type": "PointSetGeometry",
            "vertices": "424b3ed4-2fa2-44eb-a49e-6b33035c0c09"
        }
        "424b3ed4-2fa2-44eb-a49e-6b33035c0c09": {
            "type": "Vector3Array",
            "size": 300,
            "array": "wujkQgCK7D8XQKZLuT3sP3POE2Y1Eeg/XOv0WTQr1j8xKbL8y3/gP4AbhEil3bI/mP6Pi3S10j88\nWTb+BGHcP90eBkpCoec/C/WE77jp5z8CCKYxBubWP8ANOGLdrtE/rlG8axHY3D9wl5HMuL3BPyBs\nRr8SA+E/IPT9/LzwtT8yMBjQWbzUPznqeaokouk/ZmLWHN5M4T+eij0fogrUP8ZVNQvNVu4/Pvn1\ndPln3j8AZWRDpODoPzxO99Jbb8U/feuEVWcu5T/vxqCncODrP65381ySDOw/cTU5OpwF5j/4LWpp\nGjbcP9lul+fIjOw/aQP/n+X77j/TTDLUvnjvP9xR/9gT4Ok/OGX/MILLsz+mIRzG+/HYP7tPpTxG\n9uE/C8AmRUj07T+GUwq7hMPuP/r79tQiYOg/ghVLMQeF3z8SxMpLXUroPywF7FQJKuw/iMK4Ou3d\nxj//iqOThennP7pvxSIDlNg/gn7fHnGQ7D/e3VwBYi3kP36h4UH1U9M/dPrneI0kzj/GlAPIq2vS\nP9qNwrVsz90/Kp0Zs4q24j+nS+hAAbnlP8hf78m/YMw/YIT2Fc894j+bkKyUcrDoP8DwcJFSP4s/\n+AgcLa5tvz8OFoDLq6vSP2ZSAqXaCO4/rDTgcGEJ6j+0Ta+C6dTJP8D3ChhjfbE/wECt0B5vtz+i\nn9RS9iTaP+yxxKIWH8c/5K1VsRZixD+442LpCWTHP0hqwvjX48k/wviRd8C30z+jq5R1QGThP7aV\nxm2VM+A/4ylDusCD4D9ezTHDywjjPzC6lPJAEMQ/x2Jb1BVX4D/aYltMkV3kP4StAv1fj8s/Ym6P\nklcD1T+Qw+Tz5VTTP/69+ONdTtE/Xp4f3QrG4z/4H3z15OHuP+7OuuhN1u8/fmtZgsDF2D950ZpP\n6y7pPwARxKx3b4s/Aj6NByWp3T/b8ChSIKvsP6ADpMUkPtA/COI/GMt+sj8MulfkUQDJP9X3kGtZ\nOuw/0nnE4P1w6D+Nvdv0KJTkPzhUHomK8+s/ixQWFLYm4j8nwV6VlX3gP0C95o1JSKw/qCw8YVZN\nvz9wsaxqlWrbP/ivRZFtEcs/Qk1p5NFk0T9w3cAfijvkP928cpk7U+0/0qiDrHPE0z8wK4OFaODN\nPxYP4QIMvuM/fBOsqp8CxT+I8QTZbTLIP+WLShOtUuU/u676DvOq5T+YMS6RUmjUP8y0+5IEO80/\nRsiNV5v/5j9fU4JiwOzsPyyJTAtgDsA/8v3jy3eQ2D+UTTwl9g/uPzmOuQx0yuw/bhGkXd2C2j9m\nnjIV40PRP0pUgx4bd98/1MGVqoqt3T+Y8LnSLmfGPyw3UsQq4eM/zgfvGK+j3T9Q0evnDYGwP9t1\nVtxa0uk/2gdc5FRM1D+swZxVdGTDP+bgeyEww+g/ja8Du4i06D/A/KlP+tHtP5DnZ18OXsU/zKKr\nGUz91T9YybQCd8jaPyswbc66fu8/qbOUlHeu6T8wkOIpTojCP5nvftvkj+s/Mo/CJ7c/0T9g9SAp\niJ2SP8LtJtuu9eI/qvR0KC3d3j9gL3SlipqnPyvMoHu2Les/I88AyFQK7j/gcYCAn+eqP5aBApxV\n4OQ/tDTmsrad0D/Ln4sQKQnvPzRRo20CUt0/7OmJ8sr3yj+nJf/hVD/nP+9Fe8oDEew/EJd9npDW\n4T9GFlwPINjRP8BFXjlsGok/W8vIlZr96j8ujIzA7kvRP+ZRXXURftA/PkEX3yPd3z90zOOTVvLH\nPyUUnjYgz+E/qq/3xHs56j8IhiKRNG7GPzieWidPQt4/jJErgnQY2T8A4DisA8wYP99Kyk2MNeM/\nONso5Yoh5D9y1OixZN/sP8kE4r6QEOY/LOpBv68bzT/IuqJpK9HOPwx9VOFyCek/Yinj3mPI3z+b\nWLUs7rrgP5BTc7DzMtg/BOSQmZ7Twj+vKWbjNNnpP/YUIaX9rtw/wJpGL8vHjT99tbFkQtzsP3Qv\n1f2abss/6GkK1Ye07z90DZYt+YvAPyBMoaEzU8Q/0cAmGmnk4D8AR8BENuGpP/QOwZciqME/RinO\n3ezd3T/k3IrZhfvtP+TcG0TVbeA/a2rDV8ic6T+ge71i+z+4PynzjB93nuY/KD3/2XANzz+ABDqR\n7HSzP+ZtMa3Tje8/8gh4fr/23T8QaX6vv1uvP8y58pzjx8M/7J3KtL2W7z+kYSL5PEHZP98Yiweb\nw+k/RnmnBqEk1D+fU1obRSXuP9lRfpM98+A/FL8PL6M54j+YcZth84fcP4yPuvcpZNA/4gG1pboY\n1j8Y4Lcl5VqyP7RW2YVzX8w/gPQTu0H1pz9kzL7FsqrMP5oGn5FeuN8/uB+GgtAU5j9Anj2680yh\nP3Aqv4vBz80/fSFOEDp05D/M7cznzIPQPziHx8BO68c/UIyGqPak4D8ICtxY/97AP7uCc4s2ZOU/\nk7pbJKkk4j/LD+HlA6DhPz6UwtiO4+I/mKry5ld6xz/Y5Fq+0ji9Pz7judoCo9Q/8whGr8Zu5j98\nbusCyKzSP0WZffCTWeo/NNZZaEos5T9l1R+PjyPrP7hf0lbKtbo/7CAvXv6IxT/Q7h22nAfXP4E7\n+cN0Kek/jnCspLIS6z+Pqb/o0JThP2yXuI9tI+E/4hQ/WMv41T/pvSwCW83vP5yUt+4ISdo/zZoi\nLUur4z/tUGCxz3fkP66qXhmG6eQ/Fqni40BW3j/jNtT3/DrpPwC+IOR7iqg/6wHHQPlP7z9vi6sE\nQOfpP2DVnati8KY/D9Klw05v6D8YgGr7L/jWP4J8wsg31+w/AHnWV8O+6D+elzUAHAzsP6OWEZKP\nZOk/uViV75NA4z+DIp+uaofkP2agSU+o8t4/Es3C/IX74j8YtnVO5bPfP/uIAFAopOE/OHExodCR\n1D8c1tIcYPvcP7wunb9Wmck/mPpsAjjUsz+QeT0klQWvP1jx4SKt68M/Yo5/88Wz7T84JG5/QDfn\nP8PAQ02i8Os/UluTWp7C3T+YkSh0GazfP/t2BUwPLus/BnUeqVa+2z8z4FKtTlTgP6XL9RhOMuo/\nZIfpIijO4D8YWKgsbcrRPyt8Or+2f+Y/Wt6JEWJN4T9aLUooarTRP9BHdQwniKI/ROI5R0wMxj+n\ns9C1sSTiP9SEZ3Xyl8U/vlVyMqti4z/uc3N/bTzfP9R6J0xy3+8/o/aL+0Q/7z8uhLK9bb/uP5Eo\nUcHg8Oo/"
        }
    }
}

fwkoch commented 5 years ago

Alternative Solution: Standard zip format

Response to pros/cons

Writing as pure JSON is impractical for large file sizes. Regarding the "pros" above -

human-readable - many text editors cannot open large (~ GBs) files. Some can (like less), but easy access by all users in unrealistic.
easy validation - in-line base64 binary does not help with this, binary blobs are relatively opaque, only easy access to metadata is useful for validation.
standard format - This is valid and solved by pure JSON but could also be solved by other standard formats.

Regarding the "cons" -

The requirement to load the entire file into memory is a significant issue. At best, this means all operations are slower, since every time we need to read or write individual objects, we must load the file. This is a regression from v1 of OMF which allows reading only lightweight metadata then stepping directly into the binary data only when required. At worst, large objects (e.g. block models with many attributes) become unsupported simply because of the chosen file format. The implications of this "con" should not be trivialized.

Proposed Solution: Zip archive

Store JSON metadata and binary as separate files within a standard, zipped folder structure. The header/metadata would look similar to the metadata above, but instead of base64 inline, binary is specified by paths to files. Something like

my_file.omf
  |- metadata.json
  |- pointset_vertices_array.bin
  |- surface_data/
  |    |- vertices.bin
  |    |- triangles.bin
  |    |- data.bin
  |    |- image.png
  |- another_folder/
       |- possibly_more_metadata.json
       |- other_binary_file.bin

Pros

Standard format, easily explored by all users
- For zipping, there are several options, but all are open and well supported (zip, 7zip, tar.gz). By changing the file extension from .omf to .zip standard OS file browsers can open and view files.
- Then, all the included files are also standard. Metadata is pure JSON (but in much smaller, manageable files with no binary). Other binary data can be either known format (e.g. .png) or simple binary array with type info still stored in metadata.
Easy to use without reading everything into memory. Files can simply be added to the folder structure, then only metadata needs editing.
Still looks like a single, simple .omf file to users who are only concerned with the full file, not internal data.

Cons

Including unrelated, unwanted binary becomes very easy - just unzip, add the binary, re-zip. However, in theory, unwanted binary could be included in the other formats as part of the binary blobs. In fact, by exposing a file system, exactly what is saved in the file becomes more explicit, so possibly this is an advantage?

rowanc1 commented 5 years ago

Zip Archive was decided in GMG workshop as the new standard.

TroyWilliams3687 commented 5 years ago

Has any thought been put into using something like SQLite instead of a zip archive:

The interesting thing about using an SQLite database and storing BLOBs in fields in tables is that you can also attach metadata to the blobs in adjacent columns in the same rows. Entities like triangular meshes could have two tables, points and triangles and be stored directly.

Disadvantages would be extra dependencies on the sqlite libraries and also changing database schemas as the format evolves.

dbrookes96 commented 5 years ago

I suggest only allowing one JSON document per archive. An argument could be made for splitting up the JSON into multiple files but I think this may just add complexity for little benefit. As long as project.json is kept fairly small and quick to parse.

Another suggestion for the layout of the zip archive structure using UUIDs.

my_file.omf
|- project.json
|- point_set/
|    |- 0f52a9b6-d641-ffea-99af-d2e840ca3187.bin  (Vector3Array)
|    |- some_data/
|    |    |- 0d4b91f1-a487-fa08-8838-047501d3764f.bin  (ScalarArray)
|    |    |- f00a8e45-fbd5-f32f-ba59-6214392dadbf.bin  (Int3Array)
|- stuff/
|    |- af3403b9-9cf9-f614-a812-022bbba41a3a.bin  (Vector3Array)
|    |- 27d07ff2-ee67-f2a5-9698-932a3c313985.bin  (Int3Array)
|- ...

The idea behind this layout is that only the stem component of the file names are used to identify the objects (UUIDs). This means that the "folders" (remember that zip folders aren't actually real, they're just a '/' in the compressed file name) are only used for organisation purposes and don't actually matter. It would therefore be valid to have the same file in the structure.

my_file.omf
|- project.json
|- 0f52a9b6-d641-ffea-99af-d2e840ca3187.bin  (Vector3Array)
|- 0d4b91f1-a487-fa08-8838-047501d3764f.bin  (ScalarArray)
|- f00a8e45-fbd5-f32f-ba59-6214392dadbf.bin  (Int3Array)
|- af3403b9-9cf9-f614-a812-022bbba41a3a.bin  (Vector3Array)
|- 27d07ff2-ee67-f2a5-9698-932a3c313985.bin  (Int3Array)
|- ...

The downside here is that its less evident what things actually are. Should we prioritise manual editability? Maybe a terminal app to read, edit and write OMF files would be more useful. project.json's layout would be similar to what is currently written to the end of an OMF files except the binary types won't have start offset and length offset. We can retrieve information such as compressed and uncompressed size from the Zip archive (could be stored in JSON is suppose).

As for the suggestion for SQLite I believe Zip is better. One of the negatives of using SQLite blobs is that afaik they have a max size of 2GB.

fwkoch commented 5 years ago

Thanks for this @dbrookes96 - I pretty much agree on all your points:

uuids ✅
single project json ✅ (Note this would be backwards compatible if we ever allow multiple json documents in the future)
flat folder structure by default ✅ (But still allowing for folders/paths)
zip > SQLite ✅

Regarding your point around storing compressed / uncompressed size - probably worth storing this in the JSON explicitly, for validation of attribute vs. element geometry, independent of deeper reading of the zip archive?

rpakdel commented 5 years ago

I wanted to say I also agree that SQLite is a better option. However, I think we should also move away from specific implementations and consider the requirements.

Must be able to efficiently store large amount of data that is likely binary format.
Must have machine and human readable metadata.
Must be able to read and write non-binary data partially

SQLite with binary BLOBs certainly fits these requirements.

As for the binary data, Base64 works. Have you looked into Protobuffers? We've had good experience with them and found them to be the fastest and easiest way to serialize and deserialize objects while maintaining forward and backward compatibility.

rpakdel commented 5 years ago

@fwkoch, one more note that might impact the format. We generally recommend reading the block model level by level where a level is most likely a bench with the thinking that usually one needs a bench of data + a limited number of benches above or below. Rarely does one need to read the entire block model into memory.

With this line of thinking, perhaps the format should allow quick access to a level of the block model without the need to read the previous levels.

hishnash commented 5 years ago

I must agree with @dbrookes96

Performance of SQLite will degrade rapidly if your table rows get large. (and are not of a fixed size). For this reason storing of binary blob data in rows of SQLight is strongly discouraged. Mostly apps that use SQLight do this with the sql db inside a container (zip or otherwise) and have binary data sitting next to the sqlight db with a reference from the db to the data.

Zekaric commented 5 years ago

One thing I find is being lost on a few people.

This is an "Interchange File Format."

Meaning, moving data from one software system to another. In which case you are usually exporting practically everything from one system and sucking everything into the other.

Random access, in my opinion, is not strictly necessary.

Cherry picking from the file, in my opinion, is not strictly necessary. If you want to cherry pick, just skip over the bits you do not care about. But you read the entire file, hopefully, in one pass.

Interchange files should be very temporary. They should only be used to migrate data and that is it. After that has been done, you delete the file. It does not, in my opinion, need edit capabilities. It does not 'need' to be fast (but no one will complain if it is.) It does needs to be 'easy' to move data between the systems and that should be paramount.

rpakdel commented 5 years ago

I agree with @Zekaric. As an interchange format that is intended for import and export the requirements would be different. The Zip/JSON format makes sense in this case.

martinken commented 5 years ago

Love the ideas here, especially something in line with @dbrookes96 suggestion, zlib ala

my_file.omf
|- project.json
|- data
    |- 0f52a9b6-d641-ffea-99af-d2e840ca3187.bin  (Vector3Array)
    |- 0d4b91f1-a487-fa08-8838-047501d3764f.bin  (ScalarArray)
    |- f00a8e45-fbd5-f32f-ba59-6214392dadbf.bin  (Int3Array)
    |- af3403b9-9cf9-f614-a812-022bbba41a3a.bin  (Vector3Array)
    |- 27d07ff2-ee67-f2a5-9698-932a3c313985.bin  (Int3Array)
    |- ...

exepulveda commented 3 years ago

Have you considered to use HDF5? It is also a standard, open format that can contain all OMF objects and metadata. For large projects, other proposed solutions, such as JSON and SQLite can be inpractical.

Zekaric commented 3 years ago

If used as an Interchange format HDF5 could work. Slight gripe is that it is a much more complicated file. You really have to use their library which is fine considering the solution being worked on here is using libraries for ZIP file format handling.

My biggest gripe with the format, which I will admit is irrelevant for data sharing, is that deleting data from an HDF5 file doesn't shrink it. The file can only grow. Unless they fixed that but when I looked into the file for some other purpose it lacked that ability and the only way to shrink a file was with a very expensive "copy to a new file and delete the old one" sort of solution.

gmggroup / omf-python

Change file to standard format (e.g. Zip, JSON) #36

Context

Problem

Proposed Solution

Pros

Cons

Overlap with other issues

Example

Alternative Solution: Standard zip format

Response to pros/cons

Proposed Solution: Zip archive

Pros

Cons