Loading and saving external data

mars-project / mars

Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and Python functions.

https://mars-project.readthedocs.io

Apache License 2.0

2.7k stars 326 forks source link

Loading and saving external data #105

Open shoyer opened 5 years ago

shoyer commented 5 years ago

This might be a silly question, but how do you load and save data with mars?

With dask.array, the main APIs are dask.array.from_array and dask.array.store. These allow for a great deal of flexibility, users can pass in arbitrary Python objects that support the __getitem__ and __setitem__ protocols.

I am intrigued about the idea of testing with mars with xarray, but these APIs would be pretty essential for us. In xarray, we use these to support IO to various file formats (e.g., HDF5, netCDF, zarr).

qinxuye commented 5 years ago

Currently, mars.tensor.array and mars.tensor.asarray which are the counterparts of numpy.array and numpy.asarray are used to create mars tensor with array like objet including numpy ndarray.

Sadly, IO interface is truly absent in Mars because we don't come up with IO interfaces to support various file format which are compatible with numpy ones.

The dask.array.from_array and dask.array.store look great, and may be a good solution, we will do some study first and try to find out how to add IO interface into Mars.

Discussion is welcomed in this thread.

qinxuye commented 5 years ago

I want to bring more backgrounds that for the inner projects which use Mars, we write the interface like mars.tensor.read_files(distributed_storage_directory), this requires users to write a MapReduce-like program to process the data into different directories in the distributed storage directory, each directory contains files with parquet or some columnar storage format, stored with i, j, k, value schema.

We were stuck in how to provide a commonly used interface which need to be as compatible as possible.

shoyer commented 5 years ago

Indeed, file formats for multi-dimensional arrays are challenging. I think the main virtue of dask's approach here is that it's quite flexible, and allow reusing existing code for working with files. You would not want to write specialized support for reading/writing only a single file-format (users tend to have strong feelings about file formats).

The downside is that now distributed workers need to know how to run arbitrary Python code from users (but this can also be quite useful, as a form of foreign function interface).