Open brightcoder01 opened 4 years ago
Thanks to @brightcoder01 for raising this question.
I reorganized the comparison in the above comment into a table.
Dask | Mars | |
---|---|---|
Kubernetes | Yes | No |
MaxCompute | No | Yes |
It seems that the question is
SELECT ... TO RUN a_python_func
, s/he might want the Python function executed in parallel to process large datasets. Should SQLFlow take care of the parallel execution use Mars/Dask/Ray/etc, or should the author of a_python_func
take care of the execution by calling Mars/Dask/Kubernetes API to file a parallel job?I think the answer is to let the author of a_python_func
to make the choice among Mars, Dask, Ray, Kubernetes API, etc.
Add notes from discussion:
In SQLFlow
TO RUN
clause, it will call a python function to do the data processing/computing. Such as useTSFresh
to extract features from time series data.There are two options for large-scale data computing with python:
Dask MaxCompute Support: × Kubernetes Support: √ link TSFresh integration: The official distributed computing support for TSFresh is Dask. link
Mars MaxCompute Support: √ Kubernetes Support: √ link - Verifying using Minikube, downloading image is too slow. We can build the image on our Mac at the first step. link TSFresh integration: no pre-made solution, need development from mars team, issue.
The compare between Dask and mars: issue