wc-duck / datalibrary

Open Source Data Library for data serialization.
Other
42 stars 8 forks source link

Improve json formatting to improve readability #147

Open Tisten opened 2 years ago

Tisten commented 2 years ago

When I compared dl to cap'n'proto, and the most striking thing cap'n'proto was better at was the awesome json formatting: Example cap'n'proto:

    {"ptr": {"graphComponent": {"graph": {
      "nodes": [
        {"type": "tm_tick_event", "label": "", "positionX": -537.10931396484375, "positionY": -14.238007545471191, "settings": {"ptr": {}}},
        {"type": "tm_mixer_play_wav", "label": "", "positionX": -332.38427734375, "positionY": -171.35000610351562, "settings": {"ptr": {}}},
        {"type": "tm_mixer_set_pitch", "label": "", "positionX": 679.589111328125, "positionY": -23.874832153320312, "settings": {"ptr": {}}},
        {"type": "tm_vec3_length", "label": "", "positionX": -124.12257385253906, "positionY": 122.64999389648438, "settings": {"ptr": {}}},

And the same in dl:

      }, {
        "GraphComponent" : "ptr_488"
      }, {
...
        "ptr_488" : {
          "Graph" : "ptr_496"
        },
...
        "ptr_496" : {
          "Nodes" : [
          {
              "Type" : "tm_tick_event",
              "Label" : null,
              "PositionX" : -537.109314,
              "PositionY" : -14.2380075,
              "Width" : 0,
              "Settings" : {
                "AimConstraint" : null
              }
            }, {
              "Type" : "tm_mixer_play_wav",
              "Label" : null,
              "PositionX" : -332.384277,
              "PositionY" : -171.350006,
              "Width" : 0,
              "Settings" : {
                "AimConstraint" : null
              }
            }, {
              "Type" : "tm_mixer_set_pitch",
              "Label" : null,
              "PositionX" : 679.589111,
              "PositionY" : -23.8748322,
              "Width" : 0,
              "Settings" : {
                "AimConstraint" : null
              }
            }, {
              "Type" : "tm_vec3_length",
              "Label" : null,
              "PositionX" : -124.122574,
              "PositionY" : 122.649994,
              "Width" : 0,
              "Settings" : {
                "AimConstraint" : null
              }
Tisten commented 2 years ago

Except for avoiding excess newlines, writing pointer payloads "inline" instead of in the end of the file make it much easier to read for a human.

wc-duck commented 2 years ago

Personally I like the "excessive" newlines and I find that easier to read. I however see your point on the pointers! How do you represent a pointer placed in line if it is pointed to more than once? Is the "ptr" : {} element some kind of marker and can have an ID? And in the Cap'n proto data I don't see other pointer-references, just a list?

Tisten commented 2 years ago

Cap'n'proto flattens (i.e removes) all pointers except AnyPointers (unions) when going to json, so circular references doesn't work at all and all data gets duplicated. So to keep the structural integrety when pointers are referenced from multiple places they still need to be identifiable, i.e have a unique name or tag, and the inlined data could be written in either all or just one of the places, e.g where the first reference to the data is. If the data is flattened and written in all places then https://github.com/wc-duck/datalibrary/issues/14 could solve deduplicating it. And even if you would go that "flatten everything" route, you would still need to abort on cyclic references and have an idetifier to refer to.

I guess the same thing is true for arrays, but since they are already written without a unique name I guess that dl already flattens them even if they refer to the same pointer?

The two main points of the newlines is that:

  1. I can often read the data of a whole game object on one screen, while in dl the graph object I looked at here took 6 screens instead of 2/3 of a screen (43 nodes took 387 lines) and thus required a lot of scrolling and mental load to memorize things. I tried making the font smaller but can still only fit 15 nodes (135 lines) before the text is unrealable. That said, it would help if the data of members were aligned, and I like your 32 bit float representation better.
  2. When inlining pointers (and arrays which already are inlined) the indentation can become huge, i.e an indentation tower which Eiffel would be envious of.

It would be awesome if json formating could be made using formatting rules similar to "clang-format", so each user can choose their own style. The more I think of it, the more I think that reformatting the json is something which can be done after DL have created the json, i.e by pipe:ing the data to another tool. So DL could just avoid writing any whitespace, and let the formatting tool add all that. It would be slower, but if the data could be piped in chunks then formatting could mostly be done in parallel with DL's json generation, so a GB file would not require twice the time.

wc-duck commented 2 years ago

Yes, member-data alignment I wouldn't mind either. If what you mean with that is:

{
    "member_1" : 1234,
    "short"    : 3456
}
wc-duck commented 2 years ago

also, I think vectors of numbers are single-line right? Because if they are not I think they should be.

wc-duck commented 2 years ago

but as you say... formatting is highly highly personal, so being able to pipe it via some kind of formatter might be the best solution. However the current api do not support streaming output and I think it would require quite a bit of new api that would probably "break" the current API-structure.

But an "unformatted" json output, would that just be no newlines at all, basically just a big long single line?

Tisten commented 2 years ago

Yes, arrays of primitives and pointers are always single line, even when they are epic in length.

And yes, you understood the data-alignment correctly.

In my mind the unformatted style is just without any whitespace/newlines at all, the smallest memory footprint to start the reformatting from, no need to strip whitespace before adding new.

The implementation used by cap'n'proto to make the formatting simple is to use a "string tree", where all elements are leafs in the tree and then parented by the lists and objects owning them. The branches can provide the summed length of all its children, making it trivial to know which lists are appropriate to keep in one line, and which elements to insert newlines and indentation between. It makes it easy to insert sub-strings into the tree while building it and can also reduce the memory footprint since identical strings can be reused instead of duplicated.

Unfortunately it is terribly modern code, very big interfaces, very few lines in implementation and utterly impossible to understand by reading. Source here: https://github.com/capnproto/capnproto/blob/3b2e368cecc4b1419b40c5970d74a7a342224fac/c%2B%2B/src/kj/string-tree.h#L69 https://github.com/capnproto/capnproto/blob/3b2e368cecc4b1419b40c5970d74a7a342224fac/c%2B%2B/src/capnp/stringify.c%2B%2B#L57