onedata / oneclient

Oneclient is the Onedata command line interface for mounting distributed virtual filesystem on local machines.
https://onedata.org
MIT License
5 stars 3 forks source link

memory mapping and fuse #9

Open image357 opened 4 years ago

image357 commented 4 years ago

Hey,

I ran into a problem when using oneclient with memory mapping and numpy arrays: Inside a oneclient mount run the following python script:

import numpy as np

save_array = np.arange(9).reshape(3,3)
np.save("array.npy", save_array, fix_imports=False)

# works
load_array1 = np.load("array.npy", mmap_mode="c", fix_imports=False)

# doesn't work
load_array2 = np.load("array.npy", mmap_mode="r", fix_imports=False)

The error in the last step is:

Traceback (most recent call last):
  File "test.py", line 10, in <module>
    load_array2 = np.load("array.npy", mmap_mode="r", fix_imports=False)
  File "/usr/local/lib/python3.6/dist-packages/numpy/lib/npyio.py", line 450, in load
    return format.open_memmap(file, mode=mmap_mode)
  File "/usr/local/lib/python3.6/dist-packages/numpy/lib/format.py", line 869, in open_memmap
    mode=mode, offset=offset)
  File "/usr/local/lib/python3.6/dist-packages/numpy/core/memmap.py", line 264, in __new__
    mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
OSError: [Errno 19] No such device

As far as I could find out, the problem might have something to do with fuse: https://stackoverflow.com/questions/46839807/mmap-no-such-device

Any way to fix this on the oneclient side? E.g. mount options?

bkryza commented 4 years ago

@image357 Hi, the problem here is indeed with fuse, for some reason mmap on fuse requires MAP_PRIVATE flag to be passed to mmap call, unfortunately this is not possible to pass this flag through high-level numpy API, but it is possible to create the mmap manually and pass it to ndarray like this:

import numpy as np
import os
import mmap

save_array = np.arange(9).reshape(3,3)
np.save("array.npy", save_array, fix_imports=False)

# works
load_array1 = np.load("array.npy", mmap_mode="c", fix_imports=False)

# works
size = os.path.getsize("array.npy")
with open("array.npy", "r") as f2:
    mm = mmap.mmap(f2.fileno(), size, offset=0, flags=mmap.MAP_PRIVATE)
    array2 = np.ndarray((3,3), buffer=mm)
    print(array2)

Please let us know if this approach is acceptable?

image357 commented 4 years ago

Thanks for your answer and sorry for the late reply. This doesn't work, though. I guess the reason is that .npy files have a specific header that saves dtype, shape and other information. Hence, putting the plain file as the ndarray buffer can't work.

The output of your code is

[[1.87585069e-309 1.17119999e+171 5.93271341e-037]
 [8.44740097e+252 2.65141232e+180 9.92152605e+247]
 [2.16209968e+233 1.05161974e-153 6.01399921e-154]]

which is not the original array. I also played around using different dtype and order arguments.

Also mmap.MAP_PRIVATE effectively creates a copy-on-write array which is equivalent to the 'c' option for np.load(..., mmap_mode='c', ...). I suppose this is something that has to be fixed on the fuse side or might not be fixable at all.

bkryza commented 4 years ago

@image357 ok, thanks for the information, unfortunately this looks like this will not be possible through oneclient.

However, we also provide a Python library - OnedataFS - which gives direct access to our filesystem without Fuse. I will try to check if it will work with mmap().

OnedataFS is available by default on oneclient Docker image or can be installed from packages. It implements the PyFilesystem API (https://docs.pyfilesystem.org/en/latest/index.html). Example basic use is as follows:

from fs.onedatafs import OnedataFS
oneprovider_host = "example.com"
oneprovider_token = "ABCD...."
odfs = OnedataFS(oneprovider_host, oneprovider_token)
spaces = odfs.listdir('')
...

Even if mmap doesn't work, please note that each file opened through OnedataFS has an internal memory buffer which will prefetch from the storage only blocks which are requested by IO operations on the handle, so it won't read the entire file into memory if not necessary, so maybe the mmap wouldn't be necessary in your case....