Open shoyer opened 5 years ago
Currently, mars.tensor.array
and mars.tensor.asarray
which are the counterparts of numpy.array
and numpy.asarray
are used to create mars tensor with array like objet including numpy ndarray.
Sadly, IO interface is truly absent in Mars because we don't come up with IO interfaces to support various file format which are compatible with numpy ones.
The dask.array.from_array
and dask.array.store
look great, and may be a good solution, we will do some study first and try to find out how to add IO interface into Mars.
Discussion is welcomed in this thread.
I want to bring more backgrounds that for the inner projects which use Mars, we write the interface like mars.tensor.read_files(distributed_storage_directory)
, this requires users to write a MapReduce-like program to process the data into different directories in the distributed storage directory, each directory contains files with parquet or some columnar storage format, stored with i, j, k, value
schema.
We were stuck in how to provide a commonly used interface which need to be as compatible as possible.
Indeed, file formats for multi-dimensional arrays are challenging. I think the main virtue of dask's approach here is that it's quite flexible, and allow reusing existing code for working with files. You would not want to write specialized support for reading/writing only a single file-format (users tend to have strong feelings about file formats).
The downside is that now distributed workers need to know how to run arbitrary Python code from users (but this can also be quite useful, as a form of foreign function interface).
This might be a silly question, but how do you load and save data with mars?
With dask.array, the main APIs are
dask.array.from_array
anddask.array.store
. These allow for a great deal of flexibility, users can pass in arbitrary Python objects that support the__getitem__
and__setitem__
protocols.I am intrigued about the idea of testing with mars with xarray, but these APIs would be pretty essential for us. In xarray, we use these to support IO to various file formats (e.g., HDF5, netCDF, zarr).