Multi-dimensional named arrays like xarray?

Hoeze commented 4 years ago

It would be cool to have Modin for multi-dimensional arrays just like xarray. Would it be possible to implement this feature or is this just too far from being doable?

devin-petersohn commented 4 years ago

Hi @Hoeze, thanks for the question!

There are some things we have in the works for arrays, but we don't have a timeline on it yet. For labeled arrays, we could potentially implement an xarray interface on the existing implementation. It is something interesting to consider.

In my experience xarray is very good and scales quite well, is there a reason you would like to see it in Modin?

Hoeze commented 4 years ago

I experienced some issues with Dask parallelization as my queries were running effectively single-threaded independent from the scheduler. I wanted to try Ray because of the shared-memory object support. My theory is that Dask has to serialize/deserialize every object just too often in my case.

In the end, I'm searching for a python-based alternative to Spark that supports all kinds of Arrow types and scales well.

Dask does not scale that well over multiple cores because of (I guess) no shared memory objects
Ray itself is just too bare metal, it has no abstractions for arrays, bags, dataframes like Dask
Modin performed great but supports only DataFrames

RichardScottOZ commented 1 year ago

@RehanSD

RehanSD commented 1 year ago

Hi @RichardScottOZ! Is this something you'd be interested in collaborating on?

modin-project / modin

Multi-dimensional named arrays like xarray? #1506