ian-whitestone / pyspark-vs-dask

[WIP] Comparing pyspark and dask for speed, memory/CPU usage, and ease of use
2 stars 1 forks source link

Testing plan/notes #1

Open ian-whitestone opened 5 years ago

ian-whitestone commented 5 years ago

Testing Plan

Dummy Credit Card Application Dataset

Test 1

Leave the default schedulers

Modifications

Test 2 - Runing Some Calcs

Modifications

Test 3 - Running some Python UDFs

Test 4 - Scaling on a Single Machine

NYC Taxi Public Dataset

Coming soon..

Sample ETL Workflows

ian-whitestone commented 5 years ago

Spark Things

Tweaking pyspark config:

conf = (SparkConf()
        .setAppName("implicit_benchmark")
        .setMaster('local[*]') # Run Spark locally with as many worker threads as logical cores on your machine.
        .set('spark.driver.memory', '16G')
        )

Spark repartitioning:

ian-whitestone commented 5 years ago

Dask Things

resources:

notes from talking to Martin Durant:

ian-whitestone commented 5 years ago

Test this out: https://github.com/dask/dask-yarn/issues/28 Make a PR for this: https://github.com/dask/dask/issues/4110

ian-whitestone commented 5 years ago

Raise issues for the following:

1) raise a new issue in S3FS for how it just returns no files if your tokens aren't refreshed but doesn't raise errors.

2) Invalid paths returns weird errors. Should return no files or empty bag.

>>> bag = dask.bag.read_avro(urlpath, storage_options = {'profile_name': AWS_PROFILE})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/anaconda3/envs/gandalf/lib/python3.6/site-packages/dask/bag/avro.py", line 103, in read_avro
    heads, sizes = zip(*out)
ValueError: not enough values to unpack (expected 2, got 0)

3) Empty bag to dataframe to pandas returns a empty tuple, you can't figure this out until you do a compute. May not be a way to fix this..check if you can join empty dask dataframes, since if you convert to pandas first you can do a manual check there.

https://github.com/dask/dask/issues/4321