Dealing with large Data (pyspark/sql/dask?????)

stuartong commented 3 years ago

Simple pd.read_csv will not work: data is too large for some sets
Can look to reduce the data instead (e.g. tickets dataset: not all violations are related to the topic under discovery)
Can also drop less relevant columns but suspect that the file will stay relatively large with 10mn rows

Alternatives

Create DBs and use SQL - already up and running see Stuart's sample notebook and database
Use pyspark - hard to set up still working on trying to get it right (have reached out to Prof Teplovs)
I hear dask is fast - anyone want to give it a go?

[ ] Get pyspark up and running
[ ] Figure out if it is faster than SQL (read that it is good only for certain use cases while slower than SQL - has overhead - for other queries )
[ ] Worst case: use in coursera shell? Not sure if it can take the size tho
[ ] Try Dask

Suspect we might have to use a hybrid of approached - but at least we figure out what options we have at our disposal early

sheilavp commented 3 years ago

https://medium.com/@prayankkul27/which-one-should-i-use-apache-spark-or-dask-22ad4a20ab77

https://towardsdatascience.com/why-and-how-to-use-dask-with-big-data-746e34dac7c3

help me to clarify if I understand it correctly. pyspark we can use SQL queries, but with dask it will be in python. Correct? I can try this, but dask is reading the file in csv format, yeah?

stuartong commented 3 years ago

@sheilavp yes dask is just another way of creating a data frame that's supposed to be faster (when compared vs pd.read_csv). You get a dask data frame that pretty similar to pandas but from what I hear (Prof Brooks) it works until it doesn't then you switch back to pandas.

So if you try open with pandas (like accidentally double clicking the large csv) it's going to be a problem but apparently with dask it might be able to handle it - not sure if we can read with dask then convert to pandas?

stuartong commented 3 years ago

Pyspark will be sql queries and all the stuff from 516

stuartong / streetsofnyc

Dealing with large Data (pyspark/sql/dask?????) #2