goodboy / tractor

A distributed, structured concurrent runtime for Python (and friends)
GNU Affero General Public License v3.0
265 stars 12 forks source link

Alternative interchange formats #58

Open goodboy opened 5 years ago

goodboy commented 5 years ago

The list I've been meaning to look through/support:

Maybe more?

We'll need to abstract the channel API to take in different types of stream types. This work will require coordination for the alt transport work in #19.

goodboy commented 4 years ago

Here is an extremely good write up on the shortcomings of pandas from the original author with links to many other great resources.

Apache arrow seems to very much be a solution to many of the prior memory constraint and inter-process ailments of big data with pandas. I haven't dug too much into recent developments but this article seems like a good entrypoint.

Anyone wanting to take a look at the ipc section in pyarrow might be able to get something cool going quickly!

salotz commented 4 years ago

You might be interested in this as well: https://github.com/real-logic/aeron

and the binary encoding it uses: https://github.com/real-logic/simple-binary-encoding

Designed for extremely low latency trading systems. There is a C++ implementation, and there is no python interface atm though. Not sure exactly what sauce they are using that is better than say, CapNProto.

All of them are probably useful in different situations. Which complicates things..

Blosc AFAIK is just a compression algorithm. Still useful, and can be used transparently (would require intelligence about when data is moving over I/O), but perhaps should be a user level thing. My suspicion is that Arrow has compression specifically accounted for, although I don't know.

salotz commented 4 years ago

For the sake of interestingness, although its likely of no use to use is: https://kaitai.io/

goodboy commented 4 years ago

Also #8 mentions msgpack-numpy.

While not a new interchange it is a system worth comparing against when considering alternatives.

goodboy commented 3 years ago

Interesting historical format SBE - simple binary encoding that's (was?) used in financial systems.

The end result of applying these design principles is a codec that has ~16-25 times greater throughput than Google Protocol Buffers (GPB) with very low and predictable latency. This has been observed in micro-benchmarks and real-world application use. A typical market data message can be encoded, or decoded, in ~25ns compared to ~1000ns for the same message with GPB on the same hardware. XML and FIX tag value messages are orders of magnitude slower again.

The sweet spot for SBE is as a codec for structured data that is mostly fixed size fields which are numbers, bitsets, enums, and arrays. While it does work for strings and blobs, many my find some of the restrictions a usability issue. These users would be better off with another codec more suited to string encoding.

Sounds like it would need to be compared with capnproto - haven't dug into any libs yet.