rapidsai-community / notebooks-contrib

RAPIDS Community Notebooks
Apache License 2.0
515 stars 267 forks source link

PyArrow error running the mortgage_e2e noebook #61

Open rohitrawat opened 5 years ago

rohitrawat commented 5 years ago

My workers crash when persisting the gpu_df right before converting to DMatrix:

gpu_dfs = [(gpu_df[0].persist(), gpu_df[1].persist()) for gpu_df in gpu_dfs]

Sample error from a worker is given below. I understand it as an arrow de-serialization error. The "got" number of bytes is identical across all workers at 67097200 which is around 64MB. The workers stop restarting after repeatedly failing. My setup is a two node cluster with 4 V-100 32GB GPUs each.

distributed.worker - ERROR - Expected to be able to read 1605658336 bytes for message body, 
got 67097200
Traceback (most recent call last):                                           
  File "/conda/envs/rapids/lib/python3.6/site-packages/distributed/worker.py", line 2290, in execute
    data[k] = self.data[k]                               
  File "/conda/envs/rapids/lib/python3.6/site-packages/zict/buffer.py", line 70, in __getitem__
    return self.slow_to_fast(key)                                                                        
  File "/conda/envs/rapids/lib/python3.6/site-packages/zict/buffer.py", line 57, in slow_to_fast      
    value = self.slow[key]                                         
  File "/conda/envs/rapids/lib/python3.6/site-packages/zict/func.py", line 39, in __getitem__
    return self.load(self.d[key])                                         
  File "/conda/envs/rapids/lib/python3.6/site-packages/distributed/protocol/serialize.py", line 392, in d
eserialize_bytes                                           
    return deserialize(header, frames)
  File "/conda/envs/rapids/lib/python3.6/site-packages/distributed/protocol/serialize.py", line 190, in d
eserialize
    return loads(header, frames)
  File "/conda/envs/rapids/lib/python3.6/site-packages/distributed/protocol/serialize.py", line 56, in da
sk_loads
    return loads(header, frames)
  File "/conda/envs/rapids/lib/python3.6/site-packages/distributed/protocol/arrow.py", line 49, in deseri
alize_table
    return reader.read_all()
  File "pyarrow/ipc.pxi", line 290, in pyarrow.lib._RecordBatchReader.read_all
  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Expected to be able to read 1605658336 bytes for message body, 
got 67097200
distributed.core - INFO - Event loop was unresponsive in Worker for 7.40s.  This is often caused by long-
running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_futur
e_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f258473e080>>, <Future finished exception=
ArrowIOError('Expected to be able to read 1605658336 bytes for message body, 
got 67097200',)>)
Traceback (most recent call last):
  File "/conda/envs/rapids/lib/python3.6/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/conda/envs/rapids/lib/python3.6/site-packages/tornado/ioloop.py", line 767, in _discard_future_r
esult
    future.result()
  File "/conda/envs/rapids/lib/python3.6/site-packages/tornado/gen.py", line 209, in wrapper
    yielded = next(result)
  File "/conda/envs/rapids/lib/python3.6/site-packages/distributed/worker.py", line 2290, in execute
    data[k] = self.data[k]
  File "/conda/envs/rapids/lib/python3.6/site-packages/zict/buffer.py", line 70, in __getitem__
    return self.slow_to_fast(key)
  File "/conda/envs/rapids/lib/python3.6/site-packages/zict/buffer.py", line 57, in slow_to_fast
    value = self.slow[key]
  File "/conda/envs/rapids/lib/python3.6/site-packages/zict/func.py", line 39, in __getitem__
    return self.load(self.d[key])
  File "/conda/envs/rapids/lib/python3.6/site-packages/distributed/protocol/serialize.py", line 190, in d
eserialize
    return loads(header, frames)
  File "/conda/envs/rapids/lib/python3.6/site-packages/distributed/protocol/serialize.py", line 56, in da
sk_loads
    return loads(header, frames)
  File "/conda/envs/rapids/lib/python3.6/site-packages/distributed/protocol/arrow.py", line 49, in deseri
alize_table
    return reader.read_all()
  File "pyarrow/ipc.pxi", line 290, in pyarrow.lib._RecordBatchReader.read_all
  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Expected to be able to read 1605658336 bytes for message body, 
got 67097200
ayushdg commented 5 years ago

The reason for this error is insufficient host memory (cpu ram). An intermediate step of the notebook is writing out the results from ETL to arrow on host memory, and then reading a subset back on the gpu for XGBoost training.

vaibhavsingh007 commented 4 years ago

Hi @ayushdg i'm facing a similar issue.

OSError: Expected to be able to read 161256 bytes for message body, got 144086

Do we need to increase driver, or executor mem? The cluster has 1.6 T

ayushdg commented 4 years ago

@vaibhavsingh007 Couple of questions:

lpyhdzx commented 4 years ago

I used to meet the same issue when using datasets, with which one can download dataset. I fixed problem by deleting the ".lock" file generated last time. The exception occured because the downloading process didn't really completed last time. Hope this can help.

dss010101 commented 1 year ago

I used to meet the same issue when using datasets, with which one can download dataset. I fixed problem by deleting the ".lock" file generated last time. The exception occured because the downloading process didn't really completed last time. Hope this can help.

whaat lock file are you refrerring to? seeing smae issue.

ashokrajab commented 1 year ago

whaat lock file are you refrerring to? seeing smae issue.

@dss010101 check under the path $HOME/cache/huggingface/metrics/dummy_metric/default