Open rohitrawat opened 5 years ago
The reason for this error is insufficient host memory (cpu ram). An intermediate step of the notebook is writing out the results from ETL to arrow on host memory, and then reading a subset back on the gpu for XGBoost training.
Hi @ayushdg i'm facing a similar issue.
OSError: Expected to be able to read 161256 bytes for message body, got 144086
Do we need to increase driver, or executor mem? The cluster has 1.6 T
@vaibhavsingh007 Couple of questions:
Are you using the latest version of the notebook? (there have been a few updates recently)
The current script is set up for single-node cluster only. You might have to change the configuration on starting a cluster on multiple nodes.
If a single node has 1.6T of host memory, you should not be seeing memory issues. Could you share more details about the environment?
I used to meet the same issue when using datasets, with which one can download dataset. I fixed problem by deleting the ".lock" file generated last time. The exception occured because the downloading process didn't really completed last time. Hope this can help.
I used to meet the same issue when using datasets, with which one can download dataset. I fixed problem by deleting the ".lock" file generated last time. The exception occured because the downloading process didn't really completed last time. Hope this can help.
whaat lock file are you refrerring to? seeing smae issue.
whaat lock file are you refrerring to? seeing smae issue.
@dss010101 check under the path $HOME/cache/huggingface/metrics/dummy_metric/default
My workers crash when persisting the gpu_df right before converting to DMatrix:
Sample error from a worker is given below. I understand it as an arrow de-serialization error. The "got" number of bytes is identical across all workers at 67097200 which is around 64MB. The workers stop restarting after repeatedly failing. My setup is a two node cluster with 4 V-100 32GB GPUs each.