pyodide / pyodide

Pyodide is a Python distribution for the browser and Node.js based on WebAssembly
https://pyodide.org/en/stable/
Mozilla Public License 2.0
12.01k stars 812 forks source link

Use javascript ndarray package for Buffer conversions #1168

Closed hoodmane closed 3 years ago

hoodmane commented 3 years ago

The ndarray npm package seems to be the clear favorite for ndarrays in javascript: https://www.npmjs.com/package/ndarray They don't have an analogue for the suboffsets field of the Python Buffer Protocol, but as long as suboffsets is null, it should be possible to do a fairly direct conversion. This should be much better than the current array copying approach, which doesn't really work correctly (see #1134).

rth commented 3 years ago

+1 to some standard container for arrays, however I think we should study this question in more details. By now I was hoping that there would be some standard implementations if N dimensional arrays for WASM, maybe written in Rust, that we could use. But after after a quick look it doesn't seem to be the case (e.g. https://github.com/rust-ndarray/ndarray/issues/845).

https://github.com/scijs/ndarray is indeed a lightweight implementation (350 LoC), but as far as I understand it's based on 1D JS arrays and I wonder how the performance is if you try to do basic operations with it. For instance, say I have a 2D array and I want to sum it along an axis. It looks like this would need to be done manually with for loops. So as a container it works, but not so much as an ndarray library (at least coming from python) since the only implemented feature seems to be indexing.

The is also the question of Appache Arrow which is getting fairly popular for tabular data. There is a JS implementation in https://www.npmjs.com/package/apache-arrow That would likely make sense for tabular data, and possibly 1D array but I'm not sure they support ND arrays. For more details about this topic see #802

Also related there is some traction for the zarr ndarray format e.g. in https://github.com/zarr-developers/community/issues/14 but it's more a storage format.

hoodmane commented 3 years ago

By now I was hoping that there would be some standard implementations if N dimensional arrays for WASM, maybe written in Rust, that we could use.

That would be great, and I agree that it seems like this ought to exist.

it's based on 1D JS arrays

To be fair, "1D JS arrays" are just Javascript's memory buffers. In some sense all efficient multidimensional arrays are abstractions built on top of memory which is a 1D array.

the only implemented feature seems to be indexing

It looks to me like there is a significant amount of functionality supported in small side modules. If you search for ndarray on npmjs, most packages found there add functionality to this ndarray package. It's common in packages intended for the web to minimize download size by splitting into lots of small pieces so you only pay for what you need. Here's the list of modules that work with ndarray: https://github.com/scijs/ndarray/wiki/ndarray-module-list

hoodmane commented 3 years ago

Apache Arrow certainly looks like it is rich enough for this use case, seems like it might take a lot of effort to integrate it though. It seems to me like ndarray is simple enough that we could quickly and easily use it to improve over the current undesirable situation of using nested Javascript Arrays (which are hashmaps) for Javascript multidimensional arrays.

hoodmane commented 3 years ago

I guess the main issue here is we don't really have a clear use case for the javascript memory buffers. If someone wants to do some sort of scientific computing using a javascript library, then that library probably needs some particular data format. Perhaps the best thing to do is to set up a few convenient way to access / manipulate python arrays from javascript and then allow people to write their own converters to whatever format they need?

See related discussions in #267 and #964.

I should also say that I am pretty ignorant about low level memory formats / scientific computing so my opinions about this are not well informed.

daoxian commented 3 years ago

@hoodmane In fact, a automatic/implied conversion from buffer to JS ndarray is just an "icing on the cake icing" in my opinion, yet the key factor is to make JS and Python see and locate each other's variable memory address easily and as fast as possible. The users can even handle the type conversions manually by themselves once they obtain the variable address.

As far as I know, npm module numjs maybe a appropriate candidate to undertake the Buffer conversions. Numjs is based on ndarray, and is also good at image processing.

hoodmane commented 3 years ago

Right, well as I think we see in #1202 the performance problem is caused by the attempt to make the memory convenient to access from Javascript rather than returning a raw memory buffer.

The users can even handle the type conversions manually by themselves once they obtain the variable address.

I can show you how to do that, though of course it's not stable. Pyodide tries to be very clever and do the right thing, but it can be helpful in many cases to directly access internals. We might want to add a separate "unsafe" API with direct memory operations and a warning that misuse will cause undefined behavior.

daoxian commented 3 years ago

Right, well as I think we see in #1202 the performance problem is caused by the attempt to make the memory convenient to access from Javascript rather than returning a raw memory buffer.

@hoodmane Another performance issue is caused by this line: im_patch_bytes = im_patch.tobytes()

(see python codes in #1203 , and that's why I mentioned the issue in #1202 )

In order to get the image content from JS, I have to convert from python's numpy ndarray to bytes in Python, to be converted automatically by pyodide into Uint8ClampedArray. But the tobytes() function call is slow, too. I think maybe there's also a way to obtain python ndarrays directly from JS without extra data duplicating (.tobytes() copies data).

daoxian commented 3 years ago

If pyodide could handle numpy conversion to JS ArrayBuffer without copying data, it would greatly help to improve performance and convinence in many cases.

daoxian commented 3 years ago

In short words, the conversions are necessary from both side, in the combination case of : 1) there's no appropriate functions in python, so 2) python calls function in JS, passing numpy ndarray as args to JS, and 3) JS function return ndarray data back to python.

hoodmane commented 3 years ago

I think what I am going to do is add PyProxyBufferMethods. In the case that PyBuffer_IsContiguous returns true, I can add a method that slices out the entire chunk of memory backing the buffer and returns it as a TypedArray. Otherwise there isn't much that can be done to avoid the copy: it doesn't look to me like there's any support in numjs for strides or suboffsets. This should be a pretty quick patch I think.

hoodmane commented 3 years ago

Should say let depth=1. But you can just use deepCopyToJavascript(1). I think we're going to rename deepCopyToJavascript ==> toJs and delete shallowCopyToJavascript.

hoodmane commented 3 years ago

Closed by #1215: with getBuffer we provide the data that ndarray needs. If they like the user can load ndarray and hook it together.