jasperzhong / cs-notes

CS认知体系
6 stars 0 forks source link

Apache Arrow #28

Closed jasperzhong closed 2 years ago

jasperzhong commented 2 years ago

https://arrow.apache.org/docs/python/plasma.html

jasperzhong commented 2 years ago

https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/

Plasma: A High-Performance Shared-Memory Object Store

Plasma holds immutable objects in shared memory so that they can be accessed efficiently by many clients across process boundaries.

One of the goals of Apache Arrow is to serve as a common data layer enabling zero-copy data exchange between multiple frameworks. A key component of this vision is the use of off-heap memory management (via Plasma) for storing and sharing Arrow-serialized objects between applications.

完美. 有个词叫做off-heap,是指不被GC控制的heap data. https://stackoverflow.com/a/45267441/9601110

使用了Google Flatbuffers作为serialization library. 这是一个非常浅的serialize. 速度和raw struct不相上下. (benchmark

jasperzhong commented 2 years ago

https://arrow.apache.org/docs/python/plasma.html

看上去能handle numpy,用pa.Tensor.from_numpy. 完美.

使用步骤:

  1. 创建一个ObjectID (20bytes). 一般用np.random.bytes(20).
  2. 创建buffer client.create(object_id, size). 如果object_id已经存在,则会报错; 如果OOM, 报PlasmaStoreFull错.
  3. write buffer
  4. read buffer (later)
jasperzhong commented 2 years ago

https://arrow.apache.org/docs/python/generated/pyarrow.plasma.PlasmaClient.html

client API page.