ian-whitestone / pyspark-vs-dask

[WIP] Comparing pyspark and dask for speed, memory/CPU usage, and ease of use
2 stars 1 forks source link

Questions #5

Open ian-whitestone opened 5 years ago

ian-whitestone commented 5 years ago

r5.xlarge: Running out of disk space despite having a 50GB EBS volume & 36GB RAM with cnt = cnt.compute(num_workers=10)

(dask) ubuntu@XXXXX:~/pyspark-vs-dask/scripts/test2$ python dsk_multi_df_filter_cnt.py
2018-10-21 03:08:22,523|INFO|logger: START: Creating dask bag 1
2018-10-21 03:09:26,090|INFO|logger: FINISH: Dask bag1 created
2018-10-21 03:09:26,090|INFO|logger: START: Creating dask dataframe 1
2018-10-21 03:09:51,815|INFO|logger: FINISH: Dask dataframe 1 created
2018-10-21 03:09:51,815|INFO|logger: START: Creating dask bag 2
2018-10-21 03:10:56,966|INFO|logger: FINISH: Dask bag2 created
2018-10-21 03:10:56,966|INFO|logger: START: Creating dask dataframe 2
2018-10-21 03:11:17,154|INFO|logger: FINISH: Dask dataframe 2 created
2018-10-21 03:11:17,154|INFO|logger: START: Joining dataframes
2018-10-21 03:11:29,696|INFO|logger: FINISH: Finished joining dataframes
2018-10-21 03:11:29,696|INFO|logger: START: Starting count
Traceback (most recent call last):
  File "dsk_multi_df_filter_cnt.py", line 89, in <module>
    cnt = cnt.compute(num_workers=10)
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/dask/base.py", line 156, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/dask/base.py", line 395, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/dask/threaded.py", line 75, in get
    pack_exception=pack_exception, **kwargs)
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/dask/local.py", line 501, in get_async
    raise_exception(exc, tb)
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/dask/compatibility.py", line 112, in reraise
    raise exc
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/dask/local.py", line 272, in execute_task
    result = _execute_task(task, data)
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/dask/local.py", line 253, in _execute_task
    return func(*args2)
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/dask/dataframe/shuffle.py", line 479, in shuffle_group_3
    p.append(d, fsync=True)
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/partd/encode.py", line 25, in append
    self.partd.append(data, **kwargs)
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/partd/buffer.py", line 45, in append
    self.flush(keys)
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/partd/buffer.py", line 99, in flush
    self.slow.append(dict(zip(keys, self.fast.get(keys))))
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/partd/file.py", line 46, in append
    os.fsync(f)
OSError: [Errno 28] No space left on device

Seems like i still have disk space available....where is writing files to? screen shot 2018-10-22 at 8 52 02 am

Maybe it cleans up files after...here is another test running on a c5.9xlarge with 100GB EBS volume, this was taken mid-test:

Filesystem      Size  Used Avail Use% Mounted on
udev             35G     0   35G   0% /dev
tmpfs           6.9G  289M  6.6G   5% /run
/dev/nvme0n1p1   97G   52G   46G  54% /
tmpfs            35G     0   35G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            35G     0   35G   0% /sys/fs/cgroup
none             35G     0   35G   0% /run/shm
/dev/loop0       88M   88M     0 100% /snap/core/5328
tmpfs           6.9G     0  6.9G   0% /run/user/1000

https://github.com/dask/dask/issues/1659

Cleaning up files after force-killed job: rm -rf /tmp/tmpw4uytluo.partd

ian-whitestone commented 5 years ago

Dask distributed memory consumption

Coming soon..

ian-whitestone commented 5 years ago

https://stackoverflow.com/questions/268680/how-can-i-monitor-the-thread-count-of-a-process-on-linux

ps huH p 11852 | wc -l

sudo apt install htop

ian-whitestone commented 5 years ago

Questions for Martin