stuartong / streetsofnyc

MADS Milestone 1 - Sheila/Moutaz/Stuart
0 stars 1 forks source link

Dealing with large Data (pyspark/sql/dask?????) #2

Open stuartong opened 3 years ago

stuartong commented 3 years ago

Alternatives

  1. Create DBs and use SQL - already up and running see Stuart's sample notebook and database
  2. Use pyspark - hard to set up still working on trying to get it right (have reached out to Prof Teplovs)
  3. I hear dask is fast - anyone want to give it a go?

Suspect we might have to use a hybrid of approached - but at least we figure out what options we have at our disposal early

sheilavp commented 3 years ago

https://medium.com/@prayankkul27/which-one-should-i-use-apache-spark-or-dask-22ad4a20ab77

https://towardsdatascience.com/why-and-how-to-use-dask-with-big-data-746e34dac7c3

help me to clarify if I understand it correctly. pyspark we can use SQL queries, but with dask it will be in python. Correct? I can try this, but dask is reading the file in csv format, yeah?

stuartong commented 3 years ago

@sheilavp yes dask is just another way of creating a data frame that's supposed to be faster (when compared vs pd.read_csv). You get a dask data frame that pretty similar to pandas but from what I hear (Prof Brooks) it works until it doesn't then you switch back to pandas.

So if you try open with pandas (like accidentally double clicking the large csv) it's going to be a problem but apparently with dask it might be able to handle it - not sure if we can read with dask then convert to pandas?

stuartong commented 3 years ago

Pyspark will be sql queries and all the stuff from 516