Open stuartong opened 3 years ago
https://medium.com/@prayankkul27/which-one-should-i-use-apache-spark-or-dask-22ad4a20ab77
https://towardsdatascience.com/why-and-how-to-use-dask-with-big-data-746e34dac7c3
help me to clarify if I understand it correctly. pyspark we can use SQL queries, but with dask it will be in python. Correct? I can try this, but dask is reading the file in csv format, yeah?
@sheilavp yes dask is just another way of creating a data frame that's supposed to be faster (when compared vs pd.read_csv). You get a dask data frame that pretty similar to pandas but from what I hear (Prof Brooks) it works until it doesn't then you switch back to pandas.
So if you try open with pandas (like accidentally double clicking the large csv) it's going to be a problem but apparently with dask it might be able to handle it - not sure if we can read with dask then convert to pandas?
Pyspark will be sql queries and all the stuff from 516
Alternatives
Suspect we might have to use a hybrid of approached - but at least we figure out what options we have at our disposal early