vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.63k stars 589 forks source link

Add support for binary feed format #15932

Open jobergum opened 3 years ago

jobergum commented 3 years ago

Feeding documents with large tensor fields (e.g tensor(p{},dt{},x[128})) using JSON or XML(deprecated) serialization is cumbersome as string representation of float/double is costing a lot of network bandwidth, storage and processing (serialize, deserialize).

image

kkraune commented 3 years ago

should we have a sample docproc that transforms from a binary field to a tensor field?

baldersheim commented 3 years ago

We do have an undocumented tool 'vespa-feed-perf' for simple file based usage. It can take a .json or .xml and generate serialized binary documents using our undocumented binary format. You can then compress this file and transfer it. You can then use the same vespa-feed-perf tool and feed it to vespa. This is what is done in some of the performance tests to reduce the amount of data. If you are using the httpclient I guess it can use gzip compression to reduce network cost.

jobergum commented 3 years ago

I think the main pain point is storage and the cost of serialization and deserialization including compressing it. To feed from grid I need to convert to json, then transfer it over the wire through vespa http client, then it's deserialized and then converted to vespa binary protocol.