RAPIDS 0.10 and NYCTaxi-E2E notebook: Boolean and RMM_ERROR_OUT_OF_MEMORY errors

vilmara commented 5 years ago

Describe the bug The NYCTaxi-E2E notebook is throwing boolean and rmm errors:

distributed.worker - WARNING -  Compute Failed
Function:  train_part
args:      ({'DMLC_NUM_WORKER': 4, 'DMLC_TRACKER_URI': '100.71.242.28', 'DMLC_TRACKER_PORT': 9091}, {'learning_rate': 0.3, 'max_depth': 8, 'objective': 'reg:squarederror', 'subsample': 0.6, 'gamma': 1, 'silent': True, 'verbose_eval': True, 'tree_method': 'gpu_hist', 'n_gpus': 1, 'nthread': 1}, [(         day  day_of_week     diff  dropoff_latitude  dropoff_latitude_r  ...  pickup_longitude  pickup_longitude_r  rate_code  trip_distance  year
0          9          4.0   426000         40.731789           40.730000  ...        -73.994766          -74.000000          1            0.7  2014
1          9          4.0   540000         40.763996           40.759998  ...        -73.982391          -73.989998          1            1.4  2014
2          9          4.0   899000         40.765217           40.759998  ...        -73.988571          -73.989998          1            2.3  2014
3          9          4.0   403000         40.777050           40.770000  ...        -73.960213          -73.970001
kwargs:    {'dmatrix_kwargs': {}, 'num_boost_round': 100}
Exception: XGBoostError('[20:25:24] /conda/conda-bld/xgboost_1571337679414/work/src/data/simple_csr_source.cu:161: Boolean is not supported.\nStack trace:\n  [bt] (0) /opt/conda/envs/rapids/lib/libxgboost.so(+0xc9594) [0x7f85a4a83594]\n  [bt] (1) /opt/conda/envs/rapids/lib/libxgboost.so(xgboost::data::SimpleCSRSource::FromDeviceColumnar(std::vector<xgboost::Json, std::allocator<xgboost::Json> > const&, bool, float)+0x743) [0x7f85a4c66443]\n  [bt] (2) /opt/conda/envs/rapids/lib/libxgboost.so(xgboost::data::SimpleCSRSource::CopyFrom(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, float)+0xc74) [0x7f85a4ade9e4]\n  [bt] (3) /opt/conda/envs/rapids/lib/libxgboost.so(XGDMatrixCreateFromArrayInterfaces+0x1c8) [0x7f85a4a91b08]\n  [bt] (4) /opt/conda/envs/rapids/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f88c6405630]\n  [bt] (5) /opt/conda/envs/rapids/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7f88c6404fed]\n  [bt] (6) /opt/conda/envs/rapids/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f88c641c00e]\n  [bt] (7) /opt/conda/envs/rapids/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x13a45) [0x7f88c641ca45]\n  [bt] (8) /opt/conda/envs/rapids/bin/python(_PyObject_FastCallDict+0x8b) [0x55ea6ea7f7bb]\n\n',)

distributed.worker - WARNING -  Compute Failed
Function:  execute_task
args:      ((subgraph_callable, (subgraph_callable, (subgraph_callable, (subgraph_callable, (<function apply at 0x7f88c64bd378>, <function _read_csv at 0x7f86cf48b0d0>, ['/home/dell/rapids/data/nyc-taxi/yellow_tripdata_2014-08.csv', array([dtype('O'), dtype('O'), dtype('O'), dtype('int64'),
       dtype('float64'), dtype('float64'), dtype('float64'),
       dtype('int64'), dtype('O'), dtype('float64'), dtype('float64'),
       dtype('O'), dtype('float64'), dtype('float64'), dtype('float64'),
       dtype('float64'), dtype('float64'), dtype('float64')], dtype=object)], {'byte_range': (1342177280, 268435456), 'names': Index(['vendor_id', ' pickup_datetime', ' dropoff_datetime',
       ' passenger_count', ' trip_distance', ' pickup_longitude',
       ' pickup_latitude', ' rate_code', ' store_and_fwd_flag',
       ' dropoff_longitude', ' dropoff_latitude', ' payment_type',
       ' fare_amount', ' surcharge', ' mta_tax', ' tip_amount',
       ' tolls_amount', ' total_amount'],
      dtype='object'),
kwargs:    {}
Exception: RuntimeError('RMM error encountered at: /conda/conda-bld/libcudf_1571332820798/work/cpp/src/io/utilities/wrapper_utils.hpp:78: 4 RMM_ERROR_OUT_OF_MEMORY',)

Steps/Code to reproduce bug running the notebook via docker image rapidsai/rapidsai:0.10-cuda10.1-runtime-ubuntu18.04

Environment details (please complete the following information):

Environment location: Docker
Method of RAPIDS libraries install: Docker
- commands used: docker run --gpus all --rm -it --net=host -p 8888:8888 -p 8787:8787 -p 8786:8786 -v /home/rapids/notebooks-contrib/:/rapids/notebooks/contrib/ -v /home/rapids/data/:/data/ rapidsai/rapidsai:0.10-cuda10.1-runtime-ubuntu18.04

vilmara commented 5 years ago

It looks like dask_xgboost is not handling boolean columns, I have dropped the boolean column 'day_of_week' (which wasn't needed) and it worked, but why dask_xgboost can't handle boolean columns?

taureandyernv commented 5 years ago

Hey @vilmara!

working with Ty on the Boolean issue and raised the issue to our xgboost and our dask teams and our teams are looking into it. Will let you know the resolution and when to expect it. For now, i'll drop the "day of week" boolean column in a PR (or you could PR it yourself and it would count as an awesome community contribution!)

Are you still having the out of memory issue?

vilmara commented 5 years ago

Hi @taureandyernv, thanks for your prompt reply. Here are some comments and questions in regards to the NYCTaxi notebook:

1- PR sent #215 for error Boolean is not supported

2- Error RMM_ERROR_OUT_OF_MEMORY. I fixed it passing the flags --device-memory-limit 12GB --memory-limit 12GB on each node

3- What does mean the below warning and how to eliminate it:

WARNING: 
/conda/condabld/xgboost_1571337679414/work/include/xgboost/generic_parameters.h:28:
n_gpus:
        Deprecated. Single process multi-GPU training is no longer supported.
        Please switch to distributed training with one process per GPU.
        This can be done using Dask or Spark.  See documentation for details.

4- How can I implement Rapids Memory Manager Functionality (RMM) on RAPIDS_v0.10? I used the previous technique and I am getting the error ModuleNotFoundError: No module named 'librmm_cffi'

ayushdg commented 5 years ago

3- What does mean the below warning and how to eliminate it:

WARNING: 
/conda/condabld/xgboost_1571337679414/work/include/xgboost/generic_parameters.h:28:
n_gpus:
        Deprecated. Single process multi-GPU training is no longer supported.
        Please switch to distributed training with one process per GPU.
        This can be done using Dask or Spark.  See documentation for details.

XGBoost updates should be able to determine the number of GPUs automatically based on the client and does not require the ngpus param. Removing it altogether should work. Feel free to inform us if things don't work as expected in that case.

4- How can I implement Rapids Memory Manager Functionality (RMM) on RAPIDS_v0.10? I used the previous technique and I am getting the error ModuleNotFoundError: No module named 'librmm_cffi'

The RMM imports have changed. Updating the imports to import rmm and from rmm import rmm_config should do the trick.

taureandyernv commented 5 years ago

Hey @vilmara, I have been running the notebook on a 2x GPU system, so its taking me a bit longer per iteration than i think you or Ty :). Just a quick reply...

Thanks so much!!! I actually have a small edit to the PR though, when it goes through the next iteration and sends me the RMM.
Dask should be taking care of the memory management. Are you using a shared system? I am still chugging along...
and 4. @ayushdg was kind enough to send you the solutions to 3 and 4. Thanks buddy!

A few things changed in v0.10 and I'm working with the community (like you!) and devs to iron out any wrinkles.

vilmara commented 5 years ago

Hi @taureandyernv / @ayushdg,

XGBoost updates should be able to determine the number of GPUs automatically based on the client and does not require the ngpus param. Removing it altogether should work. Feel free to inform us if things don't work as expected in that case.

Thanks, it eliminated the WARNING: Deprecated. Single process multi-GPU training is no longer supported

The RMM imports have changed. Updating the imports to import rmm and from rmm import rmm_config should do the trick.

I have updated the imports, and now I am getting a different error: AttributeError: module 'cudf' has no attribute 'rmm'

Hey @vilmara, I have been running the notebook on a 2x GPU system, so its taking me a bit longer per iteration than i think you or Ty :). Just a quick reply...

I am using 2 nodes with 4xV100-16GB each, the total ETL cicle is very quick on my system

2. Dask should be taking care of the memory management. Are you using a shared system? I am still chugging along...

Do you mean it doesn't require to explicitly handle the RMM functionality with the helper functions initialize_rmm_pool(), initialize_rmm_no_pool(), and finalize_rmm()?

ayushdg commented 5 years ago

The RMM imports have changed. Updating the imports to import rmm and from rmm import rmm_config should do the trick.

I have updated the imports, and now I am getting a different error: AttributeError: module 'cudf' has no attribute 'rmm'

Could you share the exact import command you used. The error message implies looking for rmm in cudf though rmm is a separate module.

Dask should be taking care of the memory management. Are you using a shared system? I am still chugging along...

Do you mean it doesn't require to explicitly handle the RMM functionality with the helper functions initialize_rmm_pool(), initialize_rmm_no_pool(), and finalize_rmm()?

Dask handles memory in the sense of partitioning the Dataframe etc. Running ETL by default will do the operations without using pool mode for underlying Cuda memory management. The Rmm pool step will help enable pool mode for the underlying gpu memory and might improve ETL perf.

tym1062 commented 5 years ago

I have modified the NYCTaxi-E2E notebook to use RMM with these changes:

from rmm import rmm_config as rmm_cfg

def initialize_rmm_pool():
    rmm_cfg.use_pool_allocator = True
    return rmm.initialize()

def initialize_rmm_no_pool():
    rmm_cfg.use_pool_allocator = False
    return rmm.initialize()

def finalize_rmm():
    return rmm.finalize()

Notice dropped cudf from cudf.rmm.initialize() and cudf.rmm.finalize().

Just before remap section of code, add:

    client.run(initialize_rmm_pool)

    # list of column names that need to be re-mapped
    remap = {}

At end, add:

    # compute the actual RMSE over the full test set
    print(math.sqrt(Y_test.squared_error.mean().compute()))

    client.run(finalize_rmm)

I tested the df = df.drop('day_of_week') work-around (thanks @vilmara !), removed n_gpus from params before train call, and the example works now with RAPIDS 0.10

vilmara commented 5 years ago

Thanks so much!!! I actually have a small edit to the PR though, when it goes through the next iteration and sends me the RMM.

Hi @taureandyernv, after implementing the recommendations mentioned on this issue (thanks @tym1062 for the update), I got the code working for the first iteration, then the second iteration is sending me the same RMM memory error you got , see below: RuntimeError: RMM error encountered at: /conda/conda-bld/libcudf_1571332820798/work/cpp/src/io/utilities/wrapper_utils.hpp:75: 4 RMM_ERROR_OUT_OF_MEMORY

ayushdg commented 5 years ago

Awesome! Thanks @tym1062 for sharing the snippet. Could you check if rmm pool mode really gets initialized after calling initialize_rmm_pool? (You can visibly see memory allocated on the gpu with nvidia-smi or the gpu dashboard). The newer version of RMM might require you to call finalize_rmm first followed by the initialize_rmm_pool step to actually initialize the gpu memory pool.

tym1062 commented 5 years ago

Thanks @ayushdg you are correct, need to finalize RMM before initalize RMM pool (i checked via nvidia-smi). Here is the correct way:

Just before remap section of code, add:

    client.run(finalize_rmm)
    client.run(initialize_rmm_pool)

    # list of column names that need to be re-mapped
    remap = {}

For Out-of-Memory issue, @vilmara have you tried using less csv data from the taxi datasets? I can generate OOM using too much data (years Jan-2014 thru Jun-2016) on my system with 4x 32GB GV100.

vilmara commented 5 years ago

thanks @tym1062, I have fixed the OOM error after the second iteration increasing the device memory limit as shown below (my new system has 4x_V100-32GB): cluster = LocalCUDACluster(ip=sched_ip, n_workers=num_gpu, device_memory_limit='30000 MiB')

@taureandyernv / @ayushdg, thanks for your support, now that NYCTaxi-E2E notebook is working without issues and with RMM functionality on RAPIDS_v0.10, will Nvidia update the notebook?, or do we need to create a PR?

taureandyernv commented 5 years ago

@JohnZed can we merge under Vilmara's PR (i can also update)

@vilmara Congrats on your new system!! I'm running 1 node with 2x GV100s, 32GB each. and using Local Dask Cuda cluster :) It took my system nearly an hour from start to get back to the Dask XGBoost training with all the data downloads. Can you add the fix to the RMM issue after we merge the RAPIDS solution into your PR? I'll merge after that. You rock!

@ayushdg thanks for sharing the great solutions!

vilmara commented 5 years ago

hi @taureandyernv

For Out-of-Memory issue, @vilmara have you tried using less csv data from the taxi datasets? I can generate OOM using too much data (years Jan-2014 thru Jun-2016) on my system with 4x 32GB GV100.

What was the largest data size you were able to handle with your system 4x 32GB GV100 before it generated OOM?

rapidsai-community / notebooks-contrib

RAPIDS 0.10 and NYCTaxi-E2E notebook: Boolean and RMM_ERROR_OUT_OF_MEMORY errors #214