holoviz / datashader

Quickly and accurately render even the largest data.
http://datashader.org
BSD 3-Clause "New" or "Revised" License
3.3k stars 365 forks source link

Support Spark Dataframe #1169

Closed JohnDietrich-Pepper closed 8 months ago

JohnDietrich-Pepper commented 1 year ago

Datashader is a fantastic library for big data visualization but unfortunately, it doesn't support Spark which has become the standard for big data processing. While datashader does support Dask, Dask doesn't cover datasets beyond 1 TB as well nor is it a default pool option on Databricks, Synapse, Glue etc... It appears that adding Spark support would be fairly easy, but I don't know enough to contribute to the repository myself.

ianthomas23 commented 1 year ago

I am not aware of any existing contributors who are working on this or who have particular interest in it. However, we are always very happy to receive contributions to add new functionality or to extend support to cover new libraries.

It appears that adding Spark support would be fairly easy, but I don't know enough to contribute to the repository myself.

In general, if you don't consider yourself qualified to contribute then probably you shouldn't consider yourself qualified to assert how easy it would be for some other person to contribute.

jbednar commented 8 months ago

Dask is commonly used for petabyte-scale datasets in the climate-science community, though you do need to be careful to choose a large enough chunk size in that case. Spark is an older de-facto standard that I think is out of scope for Datashader, though as mentioned if someone wants to work on that, go for it!