Discussion / Feedback request: Incorporating bytearray type

dg-pb commented 4 months ago

Hi all,

This probably has a wider scope than yyjson library, but I have chosen this starting point as this is the place where I hit the wall. Just trying to test if it can be penetrated. One thing is clear, the time for me to do it alone is much longer than I can spend.

First let me explain what I am trying to achieve. And why.

So I have been working on new serialisation infrastructure in python and have put a fairly ambitious constraints on it: a) Speed is on par with fastest serialisers (pickle / protocol buffers) b) Readable c) Flexible - is a superset of something as simple as json, but also can serialise complex data strutures - a substitute for pickle d) As simpler as possible, but not simpler

So let me give an example. Say I have an object:

d = {
    'a': 1,
    'b': np.array([0, 1])
}

In this case, json can have an extra layer to convert array to list and then reconstruct the array on serialisation. The problem is - it becomes very slow and condition a) is not satisfied. The only way for it to be satisfied is to convert python array to bytes: d['b'].tobytes() and use np.frombuffer on deserialisation. Note, any conversion of bytearray to string type takes minimum 10x longer than tobytes().

However, json does not support bytes. And reasonably so. So I looked at alternatives, such as binary json. And obviously none of them are readable so condition b) is not satisfied.

So I pretty much ran out of options and started looking at new solutions. Now, for speed I am using orjson library, which is python library written in rust, which in turn uses yyjson. So I tried implementing bytes type. And all went well as long as bytes object was encoded in json readable format (such as base64). However, if it was an arbitrary bytearray without any transformation, it broke at yyjson level. And that is completely understandable. Now, base64 encoding also takes at least 10x longer than 'tobytes()`.

So in short, everything becomes a bottleneck.

So now I am left with 2 options: a) Make a multi-part serialiser, which only keeps metadata in the first part, which is nicely readable json, and transfer bytes separately b) Think of something new.

Now I attempted a). And managed initial version to work. However, the sheer complexity it introduced to the framework left me very unsatisfied. E.g. implementation of complex data structures with byte components, which in turn hold other structures, which have byte components in them. Essentially, at this point d) (simplicity) started to suffer greatly.

So here I comes the idea/proposal/discussion material/request for feedback and ideas - whatever you want to call it.

So then I thought, why isn't it possible to simply have bytes type in json? Well, the issue is pretty clear. How do you know what is the length of it? This makes it troublesome for serialisers.

But is it such a big issue? So my proposal is this. Have an optional superset of json, which adds bytes type. Those, who do not need it, the flag is obviously disabled by default. However, for those who know what they are doing, it would look like this:

{
  "a": 1,
  "bytes_key": b"<length_of_bytes>:<byte_stream>"
}

Now, the drawbacks are obvious: a) inserting bytes is more involved compared to other types b) parsers / text editors will fail parsing if length is incorrect c) Editors would need a new rule to be able to display this nicely - otherwise they will just default to displaying binary file (at least sublime does)

However, benefits (at least from my perspective) seem to outweigh the cost. a) Which are exactly the ability to satisfy constraints that I wrote in the beginning.

With this ability, json can be used as a base for complex serialisers while achieving speed on par with the protocol buffers.

And most importantly, providing readable uni-part messages. It would be a flexible solution which is fit for a wide range of applications and it could potentially simplify messaging for those who prefer single solution for many applications: short size readable messages & large-size blobs with complex data structures.

Finally, I don't think implementation is that difficult: a) Implement bytes serialization: just prepend b"<byte_array_length>: and append " and put the raw bytes stream in the middle. b) Implement bytes de-serialization: read the specified length byte array. c) come up with new file extension: e.g. bjson d) Wait and see if anyone is making use of this (apart from me).

Note, <byte_array_length> could either be 64bit long or ascii digits. Ascii digits could be preferable, given it would allow manual insertion of byte arrays in text files (as long as byte array data is json string (utf-8) compatible).

So:

Maybe I am missing something and there exists something that I could use already?
Maybe I am missing something why this is not a good idea or obstacles for implmenting?

Any kind of responses are welcome.

Regards, DG

ibireme commented 4 months ago

UTF-8 encoding has some rules: https://en.wikipedia.org/wiki/UTF-8#Encoding, so not all binary data is valid UTF-8 text. If a text editor needs to show or edit binary (like copying and pasting into a document), it must still be handled in Hex or Base64.

I suggest checking out existing data serialization formats here: https://en.wikipedia.org/wiki/Comparison_of_data-serialization_formats. Maybe something fits your needs, or there might be existing editor plugins that can help.

dg-pb commented 4 months ago

I am well aware of that.

And I have searched for data serialisation formats fairly thoroughly, although I may have missed something (time will tell).

And yes, I understand that editors would need to implement a mixed Encoded/Binary display for such data format. It would essentially need to detect beginning of binary type and display the following specified bytes in hex/binary.

I appreciate, that this would require work on fairly many different levels and it might be too much of it if this was only useful for this specific case.

However, in the long run, such features could prove to be useful. If editors would extend flexibility to mixed binary, then the architecture would be there to provide other conveniences. For example, displaying files with mixed encodings or providing custom rules for abbreviating verbose languages codes.

E.g. User could define his own rules for text editor to abbreviate commonly used patterns. E.g. python's os.path.abspath(os.path.expanduser(path)) could be abbreviated to !FullPath(path). Then, such feature as automatic expansion could be expanded literally, but truncated to its short hand representation at display level.

Having that said, I will skim through different formats again. Maybe there is such format already.

dg-pb commented 4 months ago

Found one, which suits the criteria quite well: amazon ION.

It encodes base64 and is readable similarly to JSON.

Unfortunately, its python implementation is fairly slow compared to yyjson.

dg-pb commented 4 months ago

Implemented bytes type for json in RUST. orjson library, which uses yyjson. Did it via base64 encoding.

Speed comparison with pickle for 3KB.

In [24]: %timeit pickle.dumps(e)
492 ns
In [25]: %timeit orjson.dumps(d)
465 ns
In [26]: %timeit pickle.loads(ee)
321 ns
In [27]: %timeit orjson.loads(dd)
667 ns

orjson is faster for smaller sized bytes, but pickle overtakes it as size increases.

I think I will keep using multipart messages for the time being. The only way I see it would be worth having single object blobs if it was implemented in yyjson, which would accept raw unencoded bytes.

Otherwise efficiency becomes suboptimal and this pretty much kills the whole joy of using fastest json library.

I hope this will be implemented some day.

dg-pb commented 4 months ago

Question: If I implemented this, would there be interest of merging it in?

It, meaning: OPTIONAL functionality, disabled by default. If enabled, supports raw unencoded byte type specified above.

ibireme commented 4 months ago

I understand where you're coming from. You want to embed unencoded binary data within JSON to avoid encoding/decoding overhead. However, once you embed binary data, it becomes a new format, not text anymore, and no editor or other library can handle it right now.

If only yyjson could read and write it, it would turn into a private format, losing the whole point of JSON as a data exchange format.

How about trying out MessagePack format? It's a widely-used binary format that can seamlessly convert with JSON. VSCode has a plugin for reading and editing it, and there's a Python library that shows excellent performance: https://jcristharif.com/msgspec/benchmarks.html#messagepack-serialization

dg-pb commented 4 months ago

Yes, it will essentially be a new format. And editors would have to do the same as VSCode did to MessagePack.

I really like simplicity of JSON. I use it in many places and I like tools which I can use everywhere. If JSON had option for binary types, then I only need to learn 1 extra type, with MessagePack I need to get used to completely new format. Now learning it might be reasonable if benefits were worth it. E.g. if it was at least 2 times faster. But when I tested orjson some time ago, it was the fastest option among json / bson / messagepack and all similar formats with different python implementations.

So I thought ok. JSON is:

widely used - essentially everywhere.
Human-readable
If I need complex functionality around it, I can abstract it. E.g. dhall configs can do a lot of fancy things, however, it is a big dependency. Same can be done in few hundred lines of code in python wrapping underlying json.

But, JSON is not good at:

Does not optimize memory. Not a big issue here. Can just use lz4. Performance will suffer, but in my experience, it does cover most of needs.
Does not have bytes type.

So I can use it everywhere and for everything, except where I need high speed binary transfer.

So I thought adding such feature might actually be worth it. Instead of reinventing the wheel and depending on many different formats, I could use JSON everywhere with different tweaks. When I don't have binary type, it is simple JSON, when I do, it is unreadable in editors as it is, but editors could implement features for it. And furthermore, python -c print(open('file.bjson').read()) would still output human-readable thing.

As of now, I think such thing could be a fairly optimal long term solution in this age of information overload. Obviously, different formats have their advantages, but if someone needs a happy-middle format solution that can be used for anything and does ok in all aspects - this could be it.

I will try your suggested messagepack library out and see how it looks added to my own benchmarks. Will see if it outperforms orjson...

It might be an option for now. But I am reluctant to let go of this idea as of yet. I tried making peace on some compromise, but my mind drags me back to exploring this possibility...

dg-pb commented 4 months ago

Thank you for the link. I think I am switching from orjson to msgspec. 👍

Hm, so performance difference between json and msgpack using this library seems to be fairly small.

%timeit msgspec.json.encode(d) # 118 ns %timeit msgspec.msgpack.encode(d) # 116 ns

For a very basic object.

ibireme / yyjson

Discussion / Feedback request: Incorporating bytearray type #160