jcrist / msgspec

A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML
https://jcristharif.com/msgspec/
BSD 3-Clause "New" or "Revised" License
2.01k stars 59 forks source link

Apache Arrow Support #202

Open michalwols opened 1 year ago

michalwols commented 1 year ago

Would be great to have an efficient way to serialize msgspec structs to apache arrow, which would also open it up to using parquet and other tools in the arrow ecosystem like duckdb.

jcrist commented 1 year ago

Thanks for opening this! "arrow support" could mean a lot of things, can you provide a few specific concrete tasks you want to be able to handle? What would you use this feature for?

michalwols commented 1 year ago

I'm trying to hack together a human + model in the loop dataset management / annotation tool (for computer vision and nlp). It includes:

  1. JSON based REST API (django with bounding boxes, masks and model predictions encoded as JSON using msgspec)
  2. online inference / training / background tasks with ray actors, which includes fetching large embedding tables for nearest neighbor search, few shot learning and ranking
  3. OLAP queries with duckdb
  4. structured logs (currently json lines but want to switch to msgpack using msgspec), for logging query results, model predictions and training metrics
  5. storing data snapshots / views in parquet
  6. training models on top of the parquet files using pytorch, for which right now I end up converting samples to dicts, but would be nice to use the same msgspec.Struct definitions with extra methods for encoding the annotations in different formats.

Ideally I'd like to define the schema for all of these things in one place using msgspec structs, so main thing is mapping from msgspec schema to an arrow schema. Having an efficient way to serialize between msgspec structs and arrow batches/tables without converting to python dicts in the middle would be great too, a dream scenario would be a 0 copy view on top of arrow tables using an immutable version of msgspec.Struct.

TLDR for now: an efficient msgspec.arrow.encode and msgspec.arrow.decode, which would also make it easy to do the same for parquet.

michaelbilow commented 11 months ago

Bumping this issue up a bit, I have a pretty narrow usecase, where I'd like to dump msgspec.Structs into parquet for long-term storage.

Transcoding to arrow through msgspec directly would be great, but it would also be fine if I could just get the schema out via https://github.com/koxudaxi/datamodel-code-generator, which I saw @jcrist's comment on https://github.com/koxudaxi/datamodel-code-generator/issues/1278

Do you have thoughts on which project it might be more appropriate to work on?

cofin commented 10 months ago

TLDR for now: an efficient msgspec.arrow.encode and msgspec.arrow.decode, which would also make it easy to do the same for parquet.

Just to echo this. I use quite a bit of DuckDB and Arrow. This type of functionality would be very useful to me.

fungs commented 6 months ago

I can think of many binary serialization formats which are more efficient and have a richer type system than the ones provided here. In fact, I'm working on one for which I want to add msgspec support at the moment. IMO the preferred way should be to ship them as separate packages and enable msgspec to do this easily.

BTW: Right now, it seems like the framework is focused on massive small data objects rather than big ones, which is typical for web-based applications. I believe there are currently some design limitations for handling larger objects, I'm exploring right now.

Out of curiosity: I thought that Arrow is for tabular data. How would it store arbitrary structured data? Wouldn't HDF5 be a more suitable candidate?