modularml / mojo

The Mojo Programming Language
https://docs.modular.com/mojo/manual/
Other
22.88k stars 2.58k forks source link

[Feature Request] memoryview builtin and support for python buffer protocol #1515

Open guidorice opened 9 months ago

guidorice commented 9 months ago

Review Mojo's priorities

What is your request?

This enhancement request is to add support for Python's memoryview builtin and support for python buffer protocol. Here are some ideas about what kind of tasks and level of effort might be involved:

What is your motivation for this change?

Currently Mojo 0.6 has poor (nonexistent?) support for zero-copy shared memory buffers with Python.

For example in Mojo's documentation the Ray Tracing notebook has an example of raster imagery being copied into a numpy array, using MLIR ops. Not only is this an unnecessary memory copy, it's also too verbose, undocumented, and not pythonic. See def to_numpy_image(self) -> PythonObject: in source notebook.

Mojo should enable and encourage interop with existing scientific computing packages in the most efficient manner. For example the Apache Arrow format.

The Arrow C data interface is inspired by the Python buffer protocol, which has proven immensely successful in allowing various Python libraries exchange numerical data with no knowledge of each other and near-zero adaptation cost. Arrow Spec

This enhancement would also lay the groundwork for supporting the Python array API standard.

Any other details?

Related Discussions/Issues:

Reference PEPs:

gryznar commented 9 months ago

As a struct, it should be named MemoryView. Please be consistent and avoid Python's mess in naming!

guidorice commented 9 months ago

Good suggestion! The naming is a bit confusing- there is the type Py_buffer at the C level, MemoryView in Python land, and memoryview() constructor, also in Python land. Definitely would not want to add new names or concepts if that can be avoided.

Also, I thought maybe this python example with comments may help to illustrate the idea little more:

# made up example (chatbot)
import array

arr = array.array('i', [1, 2, 3, 4, 5]) 

mem_view = memoryview(arr)

# Access properties of the memoryview  
print(mem_view.nbytes)
print(mem_view.itemsize)

# Indexing and slicing like NumPy array
print(mem_view[0])
print(mem_view[-1])
print(mem_view[1:3]) 

# Iterate through the memoryview
for num in mem_view:
    print(num)

# Get a NumPy array from the memoryview 
import numpy as np
num_arr = np.frombuffer(mem_view, dtype=np.int32)
print(num_arr)

output

20
4
1
5
<memory at 0x1011590c0>
1
2
3
4
5
[1 2 3 4 5]
guidorice commented 9 months ago

I think this enhancement would open up numerous use cases like:

gryznar commented 9 months ago

I am aware, that I am quite pedantic, but if Mojo would like to implement this, it will be IMHO better to just sacrifice one character more and name this constructor "memory_view". I don't like Python's style to blend words together without any separator. Keeping names strongly synchronized with Python is also not the best, cause it will also require to directly follow its behaviour which may be painful in some cases.

If Mojo will be Python++ instead of its compiled copy, it will gain its own identity and this small improvements will be in this case very noticeable

guidorice commented 1 month ago

Linking to a neat related project here: Arrow implementation in Mojo https://github.com/kszucs/firebolt It unlocks the case where mojo is the consumer of arrow data structures.