Serialize to and deserialize from Apache Arrow format

ghost commented 6 years ago

I am using arrow and it uses flat buffers internally which are very fast.

I would be interested in extending qframe to work with flat buffers.

There is also a special schemaless flat buffers called "flexible" which does not enforce a schema. I expect this is what you want to use for qframe.

tobgu commented 6 years ago

Cool, I'd be very happy to take contributions in this area! I'll be happy to discuss this further with you, answer any questions about the current implementation and/or review PRs.

ghost commented 6 years ago

Thanks ! Well there is a great flat buffers library called gotables. This is worth considering and arrow is much latter I feel.

Check this out and have a play and think how it relates to qframe.

https://github.com/urban-wombat

I plan to work up more stuff with gotables in the urban-wombat repos.

Just totally out of time right now. The reasons are speed speed speed. Also the flat buffers are both a fast database and a fast network transport - the two core things every architecture needs. By using it as a db and network serialisation you have way less code and higher speed again.

Anyway I am very curious how it can mate with QFrame as immutable is really important

tobgu commented 6 years ago

I took some time to check out gotables, flatbuffers and how they relate to arrow. As you mention arrow uses flatbuffers for the meta data which seems nice. I don't really understand what you mean when you say that "arrow is much latter". Even if you use flatbuffers for the actual data serialization wouldn't you have to come up with the schema/format of the data you want to store? Do you mean that a custom data format (based on gotables for example) should be used initially?

Wouldn't it make sense to adopt the Arrow schema from the start and use that as the "native" serialization schema for QFrame? While browsing the Arrow data layout docs it seemed to me that a lot of the data should be possible to use with zero copying when "deserializing" given the current internal data formats in QFrame columns. where that is currently not the case adjustments to the internal format may be possible to allow it.

ghost commented 6 years ago

agrre that the arrow schema makes sense. I feel out of my knowledge depth about arrow here. I have not dug into it enough to even comment.

also influxDB startup donated the golang code btw. Its up in the air as to IF it will be maintained . has not been touched in ages.

tobgu commented 6 years ago

Yes, I also noticed the work on Arrow from Influx when it was first released and was very excited. I've also noticed that not much has happened since then. I hope they will pick it up again!

ghost commented 6 years ago

Ok so let's wait and see first if that repo gets some traction.

You can leave this issue open if you like or close it.

On Tue, 19 Jun 2018, 23:08 Tobias Gustafsson, notifications@github.com wrote:

Yes, I also noticed the work on Arrow from Influx when it was first released and was very excited. I've also noticed that not much has happened since then. I hope they will pick it up again!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tobgu/qframe/issues/1#issuecomment-398546383, or mute the thread https://github.com/notifications/unsubscribe-auth/ATuCwkE08EXMSoyswP-gQWTHOFrwTzufks5t-Wg1gaJpZM4ULxel .

tobgu commented 6 years ago

I think I'll start experimenting with the Arrow format for fast serialization and deserialization of QFrames to see how far away the current internal representation is from the Arrow format without waiting for the official repo. I'm already in need of an efficient binary format for that so why not choose Arrow.

If that repo starts moving again it may make sense to align the internal representation with Arrow entirely since it would give access to some AVX2 optimized aggregations, etc that they seem to be developing.

I'll change the title of this ticket a bit to narrow the focus to serialization and deserialization for now though.

ghost commented 6 years ago

sorry about 1 month delay. Sounds like a good approach to use the Arrow format. Have Influx of anyone touched the go implementation at all though ?

https://github.com/apache/arrow/tree/master/go/arrow

Nope.. hmm.

seems that sbinet is the maintainer for the go Arrow code ? https://github.com/apache/arrow/commits?author=sbinet

Might want to chat to him.. He works at Cern i think ?

sbinet commented 6 years ago

I've started to work on providing support for List arrays:

https://github.com/apache/arrow/pull/2402

feel free to have a look at that and comment/improve :)

(PS: I work for IN2P3/CNRS, kind of the french equivalent of NSF/DOE and I do work for some experiments based at CERN. but I am not a CERN employee per se.)

sbinet commented 6 years ago

and now the PR for Struct arrays:

https://github.com/apache/arrow/pull/2411

tobgu commented 6 years ago

Cool @sbinet, great to see the arrow initiative for Go moving again!

ghost commented 6 years ago

Wow guys this is great. Qframe with arrow solves a mountain of hoops for jump through.

Much thanks and will play around with this. If anyone has a project using these bits together please add the link ...

tobgu / qframe

Serialize to and deserialize from Apache Arrow format #1