Flexible `to_py` conversion of buffers

hoodmane commented 3 years ago

Hi @hoodmane I have another issue related to your PR, basically I am sending bytes nested in a dictionary (from native python on a server) to javascript, then I called to_py() in pyodide. Since all the bytes becomes memoryview, I have to convert them all to bytes in loops. Similar to toJs() which allows configure the dict_converter, it would be really great if to_py() be configured with typedarray_converter so the conversion can be done recursively.

Besides that, right now, the to_py conversion treat ArrayBuffer and TypedArray the same way, I am wondering would it make sense to separate them, e.g. allowing configure arraybuffer_converter and typedarray+converter so one can convert ArrayBuffer and TypedArray differently. I mean all these can be done by additional postprocessing to fit different needs, but for my case I do this in constant interaction with servers, having these options will have huge implication in performance.

Does this make sense to you?

Originally posted by @oeway in https://github.com/pyodide/pyodide/issues/1864#issuecomment-937927746

hoodmane commented 3 years ago

Hi @oeway, thanks again for your design input.

My thoughts on this are as follows: you can currently implement to_bytes in Javascript as follows:

let bytes = pyodide.globals.get("bytes");
function to_bytes(buf){
    let result = bytes(buf.byteLength);
    let result_buf = result.getBuffer("u8");
    if(ArrayBuffer.isview(buf)){
        buf = new Uint8Array(buf.buffer, buf.byteOffset, buf.byteLength);
    } else {
        buf = new Uint8Array(buf);
    }
    result_buf.data.set(buf);
    result_buf.release();
    return result;
}

(Note that doing this is perfectly free of any memory leaks on the development branch because of #1573. This code is leaky in v0.18.1)

From here you can do:

self.jsbuf = new Uint16Array([10, 11, 7, 1]);
pyodide.runPython(`
    from js import to_bytes, jsbuf
    assert to_bytes(jsbuf) == bytes([10, 0, 11, 0, 7, 0, 1, 0])
`);

Because this is possible, I think the right solution is to have a buffer_converter option which would just be a Python callable. If you need it to do special things for ArrayBuffer or for TypedArray or something, you can write your own similar function. I think it's fine to pass both ArrayBuffer and TypedArray to the same function, since we will generally need to branch on ArrayBuffer.isView but I think it's mostly desirable to do similar stuff with both.

OTOH, I could see making both a typedarray_converter and an arraybuffer_converter if you are a strong advocate for that.

How does this sound?

oeway commented 3 years ago

Hi, thanks a lot for looking into this!

First of all, I think it's great to have at least buffer_converter option for to_py(). For your to_bytes implemetation, my only problem with it is the performance wise (due to the copy operations). My current solution is to do tobytes() on the memoryview object in Python directly which in principle a share-memory operation and thus faster, correct?

I am more interested in a solution that requires no memory copy and no manual recursion for nested object to dictionary conversion. So if you can implement the buffer_converter option, we can then pass set buffer_converter=bytes if the default type is memoryview.

Here is what I just tried:

>>> memoryview(b'abc')
<memory at 0xfd1720>
>>> bytes(memoryview(b'abc'))
b'abc'

If you need it to do special things for ArrayBuffer or for TypedArray or something, you can write your own similar function. I think it's fine to pass both ArrayBuffer and TypedArray to the same function, since we will generally need to branch on ArrayBuffer.isView but I think it's mostly desirable to do similar stuff with both.

Agreed.

hoodmane commented 3 years ago

For your to_bytes implemnetation, my only problem with it is the performance wise (due to the copy operations).

This is inevitable if the starting object does not live on the wasm heap. We have to copy the data from the ArrayBuffer into the wasm heap so that wasm can touch it without a Javascript "trampoline". If you are after performance, then the data should already be on the wasm heap. There is a different design issue of how to track memory ownership through this since a raw TypedArray pointing to part of the wasm heap doesn't have anything to track whether the data has been freed or not.

we can then pass set buffer_converter=bytes if the default type is memoryview.

Doing bytes(some_memoryview) is a copy so this design is not great if your goal is to avoid incurring extra copies.

oeway commented 3 years ago

Oh, I see now, make sense, and I didn't know that bytes(memoryview) will create a copy. Is it the same for memoryview.tobytes? In that case, can you provide some utility functions for buffer_converter which won't create additional copies?

hoodmane commented 3 years ago

I didn't know that bytes(memoryview) will create a copy.

Yeah the bytes object wants to be read only so it doesn't want there to be other references to the bytes. Even if the memoryview is read only, it still copies.

Is it the same for memoryview.tobytes?

Yes.

can you provide some utility functions for buffer_converter which won't create additional copies?

I think buffer_converter can just take the buffer itself (as a JsProxy) and you can in Javascript or with create_proxy implement your own converter which makes only one copy. One copy is the minimum possible anyways. But I could be misunderstanding what you are asking for.

pyodide / pyodide

Flexible `to_py` conversion of buffers #1878