modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.9k stars 653 forks source link

Multi-dimensional named arrays like xarray? #1506

Open Hoeze opened 4 years ago

Hoeze commented 4 years ago

It would be cool to have Modin for multi-dimensional arrays just like xarray. Would it be possible to implement this feature or is this just too far from being doable?

devin-petersohn commented 4 years ago

Hi @Hoeze, thanks for the question!

There are some things we have in the works for arrays, but we don't have a timeline on it yet. For labeled arrays, we could potentially implement an xarray interface on the existing implementation. It is something interesting to consider.

In my experience xarray is very good and scales quite well, is there a reason you would like to see it in Modin?

Hoeze commented 4 years ago

I experienced some issues with Dask parallelization as my queries were running effectively single-threaded independent from the scheduler. I wanted to try Ray because of the shared-memory object support. My theory is that Dask has to serialize/deserialize every object just too often in my case.

In the end, I'm searching for a python-based alternative to Spark that supports all kinds of Arrow types and scales well.

RichardScottOZ commented 1 year ago

@RehanSD

RehanSD commented 1 year ago

Hi @RichardScottOZ! Is this something you'd be interested in collaborating on?