Open vilmara opened 5 years ago
It looks like dask_xgboost is not handling boolean columns, I have dropped the boolean column 'day_of_week' (which wasn't needed) and it worked, but why dask_xgboost can't handle boolean columns?
Hey @vilmara!
working with Ty on the Boolean issue and raised the issue to our xgboost and our dask teams and our teams are looking into it. Will let you know the resolution and when to expect it. For now, i'll drop the "day of week" boolean column in a PR (or you could PR it yourself and it would count as an awesome community contribution!)
Are you still having the out of memory issue?
Hi @taureandyernv, thanks for your prompt reply. Here are some comments and questions in regards to the NYCTaxi notebook:
1- PR sent #215 for error Boolean is not supported
2- Error RMM_ERROR_OUT_OF_MEMORY
. I fixed it passing the flags --device-memory-limit 12GB --memory-limit 12GB
on each node
3- What does mean the below warning and how to eliminate it:
WARNING:
/conda/condabld/xgboost_1571337679414/work/include/xgboost/generic_parameters.h:28:
n_gpus:
Deprecated. Single process multi-GPU training is no longer supported.
Please switch to distributed training with one process per GPU.
This can be done using Dask or Spark. See documentation for details.
4- How can I implement Rapids Memory Manager Functionality (RMM) on RAPIDS_v0.10? I used the previous technique and I am getting the error ModuleNotFoundError: No module named 'librmm_cffi'
3- What does mean the below warning and how to eliminate it:
WARNING: /conda/condabld/xgboost_1571337679414/work/include/xgboost/generic_parameters.h:28: n_gpus: Deprecated. Single process multi-GPU training is no longer supported. Please switch to distributed training with one process per GPU. This can be done using Dask or Spark. See documentation for details.
XGBoost updates should be able to determine the number of GPUs automatically based on the client and does not require the ngpus
param. Removing it altogether should work. Feel free to inform us if things don't work as expected in that case.
4- How can I implement Rapids Memory Manager Functionality (RMM) on RAPIDS_v0.10? I used the previous technique and I am getting the error
ModuleNotFoundError: No module named 'librmm_cffi'
The RMM imports have changed. Updating the imports to import rmm
and from rmm import rmm_config
should do the trick.
Hey @vilmara, I have been running the notebook on a 2x GPU system, so its taking me a bit longer per iteration than i think you or Ty :). Just a quick reply...
A few things changed in v0.10 and I'm working with the community (like you!) and devs to iron out any wrinkles.
Hi @taureandyernv / @ayushdg,
XGBoost updates should be able to determine the number of GPUs automatically based on the client and does not require the
ngpus
param. Removing it altogether should work. Feel free to inform us if things don't work as expected in that case.
Thanks, it eliminated the WARNING: Deprecated. Single process multi-GPU training is no longer supported
The RMM imports have changed. Updating the imports to
import rmm
andfrom rmm import rmm_config
should do the trick.
I have updated the imports, and now I am getting a different error: AttributeError: module 'cudf' has no attribute 'rmm'
Hey @vilmara, I have been running the notebook on a 2x GPU system, so its taking me a bit longer per iteration than i think you or Ty :). Just a quick reply...
I am using 2 nodes with 4xV100-16GB each, the total ETL cicle is very quick on my system
2. Dask should be taking care of the memory management. Are you using a shared system? I am still chugging along...
Do you mean it doesn't require to explicitly handle the RMM functionality with the helper functions initialize_rmm_pool(), initialize_rmm_no_pool(), and finalize_rmm()?
The RMM imports have changed. Updating the imports to
import rmm
andfrom rmm import rmm_config
should do the trick.I have updated the imports, and now I am getting a different error:
AttributeError: module 'cudf' has no attribute 'rmm'
Could you share the exact import command you used. The error message implies looking for rmm in cudf though rmm is a separate module.
- Dask should be taking care of the memory management. Are you using a shared system? I am still chugging along...
Do you mean it doesn't require to explicitly handle the RMM functionality with the helper functions initialize_rmm_pool(), initialize_rmm_no_pool(), and finalize_rmm()?
Dask handles memory in the sense of partitioning the Dataframe etc. Running ETL by default will do the operations without using pool mode for underlying Cuda memory management. The Rmm pool step will help enable pool mode
for the underlying gpu memory and might improve ETL perf.
I have modified the NYCTaxi-E2E notebook to use RMM with these changes:
from rmm import rmm_config as rmm_cfg
def initialize_rmm_pool():
rmm_cfg.use_pool_allocator = True
return rmm.initialize()
def initialize_rmm_no_pool():
rmm_cfg.use_pool_allocator = False
return rmm.initialize()
def finalize_rmm():
return rmm.finalize()
Notice dropped cudf
from cudf.rmm.initialize()
and cudf.rmm.finalize()
.
Just before remap
section of code, add:
client.run(initialize_rmm_pool)
# list of column names that need to be re-mapped
remap = {}
At end, add:
# compute the actual RMSE over the full test set
print(math.sqrt(Y_test.squared_error.mean().compute()))
client.run(finalize_rmm)
I tested the df = df.drop('day_of_week')
work-around (thanks @vilmara !), removed n_gpus
from params
before train call, and the example works now with RAPIDS 0.10
- Thanks so much!!! I actually have a small edit to the PR though, when it goes through the next iteration and sends me the RMM.
Hi @taureandyernv, after implementing the recommendations mentioned on this issue (thanks @tym1062 for the update), I got the code working for the first iteration, then the second iteration is sending me the same RMM memory error you got , see below:
RuntimeError: RMM error encountered at: /conda/conda-bld/libcudf_1571332820798/work/cpp/src/io/utilities/wrapper_utils.hpp:75: 4 RMM_ERROR_OUT_OF_MEMORY
Awesome! Thanks @tym1062 for sharing the snippet. Could you check if rmm pool mode really gets initialized after calling initialize_rmm_pool
? (You can visibly see memory allocated on the gpu with nvidia-smi or the gpu dashboard).
The newer version of RMM might require you to call finalize_rmm
first followed by the initialize_rmm_pool
step to actually initialize the gpu memory pool.
Thanks @ayushdg you are correct, need to finalize RMM before initalize RMM pool (i checked via nvidia-smi). Here is the correct way:
Just before remap
section of code, add:
client.run(finalize_rmm)
client.run(initialize_rmm_pool)
# list of column names that need to be re-mapped
remap = {}
For Out-of-Memory issue, @vilmara have you tried using less csv data from the taxi datasets? I can generate OOM using too much data (years Jan-2014 thru Jun-2016) on my system with 4x 32GB GV100.
thanks @tym1062, I have fixed the OOM error after the second iteration increasing the device memory limit as shown below (my new system has 4x_V100-32GB):
cluster = LocalCUDACluster(ip=sched_ip, n_workers=num_gpu, device_memory_limit='30000 MiB')
@taureandyernv / @ayushdg, thanks for your support, now that NYCTaxi-E2E notebook is working without issues and with RMM functionality on RAPIDS_v0.10, will Nvidia update the notebook?, or do we need to create a PR?
@JohnZed can we merge under Vilmara's PR (i can also update)
@vilmara Congrats on your new system!! I'm running 1 node with 2x GV100s, 32GB each. and using Local Dask Cuda cluster :) It took my system nearly an hour from start to get back to the Dask XGBoost training with all the data downloads. Can you add the fix to the RMM issue after we merge the RAPIDS solution into your PR? I'll merge after that. You rock!
@ayushdg thanks for sharing the great solutions!
hi @taureandyernv
For Out-of-Memory issue, @vilmara have you tried using less csv data from the taxi datasets? I can generate OOM using too much data (years Jan-2014 thru Jun-2016) on my system with 4x 32GB GV100.
What was the largest data size you were able to handle with your system 4x 32GB GV100 before it generated OOM?
Describe the bug The NYCTaxi-E2E notebook is throwing boolean and rmm errors:
Steps/Code to reproduce bug running the notebook via docker image
rapidsai/rapidsai:0.10-cuda10.1-runtime-ubuntu18.04
Environment details (please complete the following information):
docker run --gpus all --rm -it --net=host -p 8888:8888 -p 8787:8787 -p 8786:8786 -v /home/rapids/notebooks-contrib/:/rapids/notebooks/contrib/ -v /home/rapids/data/:/data/ rapidsai/rapidsai:0.10-cuda10.1-runtime-ubuntu18.04