rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.17k stars 528 forks source link

[QST] Dask scheduler with MNMG RFR codebase #3066

Closed mikacashman closed 3 years ago

mikacashman commented 3 years ago

Hi there, I have been working with support from NVIDIA-RAPIDS with ORNL on slack and was advised to post my issues here.

I have been observing some flaky behavior when running a MNMG version of the RF-regression code I am working with (using dask).   I have run multiple runs back-to-back on two different data sets (at least 5 runs on each) and have been recording the behavior. Below are reports on three such flaky errors I frequently encounter (some fatal to functionality, some not). The non-fatal ones do not always kill the job either which makes them difficult to manage.

Error#1 - 0 workers (sometimes I end up with 0 workers when I should have 6)

Client information:  <Client: 'tcp://10.41.0.41:5749' processes=0 threads=0, memory=0 B> 
workers: dict_keys([]) 
n_workers: 0
[...]
number of paritions = number of workers = 0 
distributed.scheduler - INFO - Remove client Client-17a28da4-17a6-11eb-b90d-70e284144aab 
distributed.scheduler - INFO - Remove client Client-17a28da4-17a6-11eb-b90d-70e284144aab 
distributed.scheduler - INFO - Close client connection: Client-17a28da4-17a6-11eb-b90d-70e284144aab 
Traceback (most recent call last): 
  File "rfr_mnmg_V2.py", line 169, in <module> 
    main() 
  File "rfr_mnmg_V2.py", line 104, in main 
    X_train_dask = dask_cudf.from_cudf(X_train, npartitions=n_partitions) 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/dask_cudf/core.py", line 643, in from_cudf 
    name=name, 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/dask/dataframe/io/io.py", line 206, in from_pandas 
    chunksize = int(ceil(nrows / npartitions)) 
ZeroDivisionError: division by zero 
distributed.scheduler - INFO - End scheduler at 'tcp://10.41.0.41:5749' 
[...]

Error#2 - Address in use (similar trace repeated for 5/6 workers in this example)

OSError: [Errno 98] Address already in use 
distributed.scheduler - INFO - Clear task state 
distributed.scheduler - INFO - Clear task state 
tornado.application - ERROR - Exception in callback functools.partial(<function TCPServer._handle_connection.<locals>.<lambda> at 0x200213d997a0>, <Task finished coro=<BaseTCPListener._handle_stream() done, defined at /gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/distributed/comm/tcp.py:437> exception=OSError(98, 'Address already in use')>) 
Traceback (most recent call last): 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback 
    ret = callback() 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/tornado/tcpserver.py", line 327, in <lambda> 
    gen.convert_yielded(future), lambda f: f.result() 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/distributed/comm/tcp.py", line 447, in _handle_stream 
    await self.comm_handler(comm) 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/distributed/core.py", line 443, in handle_comm 
    await self 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/distributed/core.py", line 290, in _ 
    await self.start() 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/distributed/scheduler.py", line 1476, in start 
    addr, allow_offload=False, **self.security.get_listen_args("scheduler") 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/distributed/core.py", line 420, in listen 
    **kwargs, 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/distributed/comm/core.py", line 172, in _ 
    await self.start() 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/distributed/comm/tcp.py", line 413, in start 
    self.port, address=self.ip, backlog=backlog 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/tornado/netutil.py", line 174, in bind_sockets 
    sock.bind(sockaddr)

Error#3 - CommClosedError (this one doesn't appear fatal)

[...]
tornado.application - ERROR - Exception in callback <bound method Client._heartbeat of <Client: 'tcp://10.41.0.45:4048' processes=6 threads=6, memory=510.00 GB>> 
Traceback (most recent call last): 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/tornado/ioloop.py", line 907, in _run 
    return self.callback() 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/distributed/client.py", line 1165, in _heartbeat 
    self.scheduler_comm.send({"op": "heartbeat-client"}) 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/distributed/batched.py", line 117, in send 
    raise CommClosedError 
distributed.comm.core.CommClosedError 
distributed.scheduler - INFO - End scheduler at 'tcp://10.41.0.45:4048'

Some run information in addition to the env detail at the end: This is being run on the summit supercomputer at ORNL (powerpc - V100 GPUs).   My run script uses a dask scheduler on 6 GPU nodes. I have a 60 second sleep after running dask-scheduler and another 60 second sleep after the jsrun command before running the RAPIDS python script.  I can provide further information or code if needed.

jsrun -c 1 -g 1 -n 6 -r 6 -a 1 --bind rs --smpiargs="off" dask-cuda-worker --scheduler-file ${dask_dir}/my-scheduler.json --local-directory ${dask_dir} --nthreads 1 --memory-limit 85GB --device-memory-limit 30GB  --death-timeout 180 --interface ib0 --enable-nvlink

I have two further questions (feel free to direct me to open separate issues if desired).

Thanks for any guidance in advance.

Click here to see environment details

     **git***
     Not inside a git repository

     ***OS Information***
     NAME="Red Hat Enterprise Linux Server"
     VERSION="7.6 (Maipo)"
     ID="rhel"
     ID_LIKE="fedora"
     VARIANT="Server"
     VARIANT_ID="server"
     VERSION_ID="7.6"
     PRETTY_NAME="Red Hat Enterprise Linux Server 7.6 (Maipo)"
     ANSI_COLOR="0;31"
     CPE_NAME="cpe:/o:redhat:enterprise_linux:7.6:GA:server"
     HOME_URL="https://www.redhat.com/"
     BUG_REPORT_URL="https://bugzilla.redhat.com/"

     REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
     REDHAT_BUGZILLA_PRODUCT_VERSION=7.6
     REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
     REDHAT_SUPPORT_PRODUCT_VERSION="7.6"
     Red Hat Enterprise Linux Server release 7.6 (Maipo)
     Red Hat Enterprise Linux Server release 7.6 (Maipo)
     Linux login1 4.14.0-115.21.2.el7a.ppc64le #1 SMP Thu May 7 22:22:31 UTC 2020 ppc64le ppc64le ppc64le GNU/Linux

     ***GPU Information***
     Tue Oct 27 15:52:38 2020
     +-----------------------------------------------------------------------------+
     | NVIDIA-SMI 418.116.00   Driver Version: 418.116.00   CUDA Version: 10.1     |
     |-------------------------------+----------------------+----------------------+
     | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
     | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
     |===============================+======================+======================|
     |   0  Tesla V100-SXM2...  On   | 00000035:03:00.0 Off |                    2 |
     | N/A   36C    P0    37W / 300W |      0MiB / 16130MiB |      0%   E. Process |
     +-------------------------------+----------------------+----------------------+
     |   1  Tesla V100-SXM2...  On   | 00000035:04:00.0 Off |                    0 |
     | N/A   42C    P0    38W / 300W |      0MiB / 16130MiB |      0%   E. Process |
     +-------------------------------+----------------------+----------------------+

     +-----------------------------------------------------------------------------+
     | Processes:                                                       GPU Memory |
     |  GPU       PID   Type   Process name                             Usage      |
     |=============================================================================|
     |  No running processes found                                                 |
     +-----------------------------------------------------------------------------+

     ***CPU***
     Architecture:          ppc64le
     Byte Order:            Little Endian
     CPU(s):                128
     On-line CPU(s) list:   0-127
     Thread(s) per core:    4
     Core(s) per socket:    16
     Socket(s):             2
     NUMA node(s):          6
     Model:                 2.1 (pvr 004e 1201)
     Model name:            POWER9, altivec supported
     CPU max MHz:           3800.0000
     CPU min MHz:           2300.0000
     L1d cache:             32K
     L1i cache:             32K
     L2 cache:              512K
     L3 cache:              10240K
     NUMA node0 CPU(s):     0-63
     NUMA node8 CPU(s):     64-127
     NUMA node252 CPU(s):
     NUMA node253 CPU(s):
     NUMA node254 CPU(s):
     NUMA node255 CPU(s):

     ***CMake***
which: no cmake in (/gpfs/alpine/syb105/proj-shared/Personal/mcashman/Apps/summit/bin:/gpfs/alpine/syb105/proj-shared/Personal/mcashman/Apps/summit/condabin:/sw/sources/lsf-tools/2.0/summit/bin:/sw/summit/xalt/1.2.0/bin:/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/darshan-runtime-3.1.7-cnvxicgf5j4ap64qi6v5gxp67hmrjz43/bin:/sw/sources/hpss/bin:/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/xl-16.1.1-5/spectrum-mpi-10.3.1.2-20200121-p6nrnt6vtvkn356wqg6f74n6jspnpjd2/bin:/sw/summit/xl/16.1.1-5/xlC/16.1.1/bin:/sw/summit/xl/16.1.1-5/xlf/16.1.1/bin:/opt/ibm/spectrumcomputing/lsf/10.1.0.9/linux3.10-glibc2.17-ppc64le-csm/etc:/opt/ibm/spectrumcomputing/lsf/10.1.0.9/linux3.10-glibc2.17-ppc64le-csm/bin:/opt/ibm/csm/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibm/flightlog/bin:/opt/ibutils/bin:/opt/ibm/spectrum_mpi/jsm_pmix/bin:/opt/puppetlabs/bin:/usr/lpp/mmfs/bin)

     ***g++***
     /usr/bin/g++
     g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-37)
     Copyright (C) 2015 Free Software Foundation, Inc.
     This is free software; see the source for copying conditions.  There is NO
     warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

     ***nvcc***
which: no nvcc in (/gpfs/alpine/syb105/proj-shared/Personal/mcashman/Apps/summit/bin:/gpfs/alpine/syb105/proj-shared/Personal/mcashman/Apps/summit/condabin:/sw/sources/lsf-tools/2.0/summit/bin:/sw/summit/xalt/1.2.0/bin:/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/darshan-runtime-3.1.7-cnvxicgf5j4ap64qi6v5gxp67hmrjz43/bin:/sw/sources/hpss/bin:/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/xl-16.1.1-5/spectrum-mpi-10.3.1.2-20200121-p6nrnt6vtvkn356wqg6f74n6jspnpjd2/bin:/sw/summit/xl/16.1.1-5/xlC/16.1.1/bin:/sw/summit/xl/16.1.1-5/xlf/16.1.1/bin:/opt/ibm/spectrumcomputing/lsf/10.1.0.9/linux3.10-glibc2.17-ppc64le-csm/etc:/opt/ibm/spectrumcomputing/lsf/10.1.0.9/linux3.10-glibc2.17-ppc64le-csm/bin:/opt/ibm/csm/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibm/flightlog/bin:/opt/ibutils/bin:/opt/ibm/spectrum_mpi/jsm_pmix/bin:/opt/puppetlabs/bin:/usr/lpp/mmfs/bin)

     ***Python***
     /gpfs/alpine/syb105/proj-shared/Personal/mcashman/Apps/summit/bin/python
     Python 3.8.3

     ***Environment Variables***
     PATH                            : /gpfs/alpine/syb105/proj-shared/Personal/mcashman/Apps/summit/bin:/gpfs/alpine/syb105/proj-shared/Personal/mcashman/Apps/summit/condabin:/sw/sources/lsf-tools/2.0/summit/bin:/sw/summit/xalt/1.2.0/bin:/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/darshan-runtime-3.1.7-cnvxicgf5j4ap64qi6v5gxp67hmrjz43/bin:/sw/sources/hpss/bin:/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/xl-16.1.1-5/spectrum-mpi-10.3.1.2-20200121-p6nrnt6vtvkn356wqg6f74n6jspnpjd2/bin:/sw/summit/xl/16.1.1-5/xlC/16.1.1/bin:/sw/summit/xl/16.1.1-5/xlf/16.1.1/bin:/opt/ibm/spectrumcomputing/lsf/10.1.0.9/linux3.10-glibc2.17-ppc64le-csm/etc:/opt/ibm/spectrumcomputing/lsf/10.1.0.9/linux3.10-glibc2.17-ppc64le-csm/bin:/opt/ibm/csm/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibm/flightlog/bin:/opt/ibutils/bin:/opt/ibm/spectrum_mpi/jsm_pmix/bin:/opt/puppetlabs/bin:/usr/lpp/mmfs/bin
     LD_LIBRARY_PATH                 : /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/darshan-runtime-3.1.7-cnvxicgf5j4ap64qi6v5gxp67hmrjz43/lib:/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/xl-16.1.1-5/spectrum-mpi-10.3.1.2-20200121-p6nrnt6vtvkn356wqg6f74n6jspnpjd2/lib:/sw/summit/xl/16.1.1-5/xlsmp/5.1.1/lib:/sw/summit/xl/16.1.1-5/xlmass/9.1.1/lib:/sw/summit/xl/16.1.1-5/xlC/16.1.1/lib:/sw/summit/xl/16.1.1-5/xlf/16.1.1/lib:/sw/summit/xl/16.1.1-5/lib:/opt/ibm/spectrumcomputing/lsf/10.1.0.9/linux3.10-glibc2.17-ppc64le-csm/lib
     NUMBAPRO_NVVM                   :
     NUMBAPRO_LIBDEVICE              :
     CONDA_PREFIX                    : /gpfs/alpine/syb105/proj-shared/Personal/mcashman/Apps/summit
     PYTHON_PATH                     :

     ***conda packages***
     /gpfs/alpine/syb105/proj-shared/Personal/mcashman/Apps/summit/bin/conda
     # packages in environment at /gpfs/alpine/syb105/proj-shared/Personal/mcashman/Apps/summit:
     #
     # Name                    Version                   Build  Channel
     _ipyw_jlab_nb_ext_conf    0.1.0                    py38_0
     _libgcc_mutex             0.1                        main
     alabaster                 0.7.12                     py_0
     anaconda                  2020.07                  py38_0
     anaconda-client           1.7.2                    py38_0
     anaconda-project          0.8.4                      py_0
     asn1crypto                1.3.0                    py38_0
     astroid                   2.4.2                    py38_0
     astropy                   4.0.1.post1      py38h7b6447c_1
     attrs                     19.3.0                     py_0
     babel                     2.8.0                      py_0
     backcall                  0.2.0                      py_0
     backports                 1.0                        py_2
     backports.functools_lru_cache 1.6.1                      py_0
     backports.shutil_get_terminal_size 1.0.0                    py38_2
     backports.tempfile        1.0                        py_1
     backports.weakref         1.0.post1                  py_1
     beautifulsoup4            4.9.1                    py38_0
     bitarray                  1.4.0            py38h7b6447c_0
     bkcharts                  0.2                      py38_0
     blas                      1.0                    openblas
     bleach                    3.1.5                      py_0
     blosc                     1.19.0               hd408876_0
     bokeh                     2.1.1                    py38_0
     boto                      2.49.0                   py38_0
     bottleneck                1.3.2            py38heb32a55_0
     brotlipy                  0.7.0           py38h7b6447c_1000
     bzip2                     1.0.8                h7b6447c_0
     ca-certificates           2020.6.24                     0
     cairo                     1.14.12              h8948797_3
     certifi                   2020.6.20                py38_0
     cffi                      1.14.0           py38he30daa8_1
     chardet                   3.0.4                 py38_1003
     click                     7.1.2                      py_0
     cloudpickle               1.5.0                      py_0
     clyent                    1.2.2                    py38_1
     colorama                  0.4.3                      py_0
     conda                     4.9.1            py38h6ffa863_0
     conda-build               3.18.11                  py38_0
     conda-env                 2.6.0                         1
     conda-package-handling    1.6.1            py38h7b6447c_0
     conda-verify              3.4.2                      py_1
     contextlib2               0.6.0.post1                py_0
     cryptography              2.9.2            py38h1ba5d50_0
     curl                      7.71.1               hbc83047_1
     cycler                    0.10.0                   py38_0
     cython                    0.29.21          py38he6710b0_0
     cytoolz                   0.10.1           py38h7b6447c_0
     dask                      2.20.0                     py_0
     dask-core                 2.20.0                     py_0
     decorator                 4.4.2                      py_0
     defusedxml                0.6.0                      py_0
     distributed               2.20.0                   py38_0
     docutils                  0.16                     py38_1
     entrypoints               0.3                      py38_0
     et_xmlfile                1.0.1                   py_1001
     expat                     2.2.9                he6710b0_2
     fastcache                 1.1.0            py38h7b6447c_0
     filelock                  3.0.12                     py_0
     flask                     1.1.2                      py_0
     fontconfig                2.13.0               h9420a91_0
     freetype                  2.10.2               h5ab3b9f_0
     fsspec                    0.7.4                      py_0
     future                    0.18.2                   py38_1
     get_terminal_size         1.0.0                         0
     gevent                    20.6.2           py38h7b6447c_0
     glib                      2.65.0               h3eb4bd4_0
     glob2                     0.7                        py_0
     gmp                       6.1.2                h7f7056e_2
     gmpy2                     2.0.8            py38hd5f6e3b_3
     greenlet                  0.4.16           py38h7b6447c_0
     h5py                      2.10.0           py38h7918eee_0
     hdf5                      1.10.4               hb1b8bf9_0
     heapdict                  1.0.1                      py_0
     html5lib                  1.1                        py_0
     icu                       58.2                 he6710b0_3
     idna                      2.10                       py_0
     imageio                   2.9.0                      py_0
     imagesize                 1.2.0                      py_0
     importlib-metadata        1.7.0                    py38_0
     importlib_metadata        1.7.0                         0
     ipykernel                 5.3.2            py38h5ca1d4c_0
     ipython                   7.16.1           py38h5ca1d4c_0
     ipython_genutils          0.2.0                    py38_0
     ipywidgets                7.5.1                      py_0
     isort                     4.3.21                   py38_0
     itsdangerous              1.1.0                      py_0
     jbig                      2.1                  h14c3975_0
     jdcal                     1.4.1                      py_0
     jedi                      0.17.1                   py38_0
     jinja2                    2.11.2                     py_0
     joblib                    0.16.0                     py_0
     jpeg                      9b                   hcb7ba68_2
     json5                     0.9.5                      py_0
     jsonschema                3.2.0                    py38_0
     jupyter                   1.0.0                    py38_7
     jupyter_client            6.1.6                      py_0
     jupyter_console           6.1.0                      py_0
     jupyter_core              4.6.1                    py38_0
     jupyterlab                2.1.5                      py_0
     jupyterlab_server         1.2.0                      py_0
     kiwisolver                1.2.0            py38hfd86e86_0
     krb5                      1.18.2               h597af5e_0
     lazy-object-proxy         1.4.3            py38h7b6447c_0
     lcms2                     2.11                 h396b838_0
     ld_impl_linux-ppc64le     2.33.1               h0f24833_7
     libarchive                3.4.2                h62408e4_0
     libcurl                   7.71.1               h20c2e04_1
     libedit                   3.1.20191231         h14c3975_1
     libffi                    3.3                  he6710b0_2
     libgcc-ng                 8.2.0                h822a55f_1
     libgfortran-ng            7.3.0                h822a55f_1
     liblief                   0.10.1               he6710b0_0
     libopenblas               0.3.10               h5a2b251_0
     libpng                    1.6.37               hbc83047_0
     libsodium                 1.0.18               h7b6447c_0
     libssh2                   1.9.0                h1ba5d50_1
     libstdcxx-ng              8.2.0                h822a55f_1
     libtiff                   4.1.0                h2733197_1
     libuuid                   1.0.3                h1bed415_2
     libxcb                    1.14                 h7b6447c_0
     libxml2                   2.9.10               he19cac6_1
     libxslt                   1.1.34               hc22bd24_0
     locket                    0.2.0                    py38_1
     lxml                      4.5.2            py38hefd8a0e_0
     lz4-c                     1.9.2                he6710b0_0
     lzo                       2.10                 h7b6447c_2
     markupsafe                1.1.1            py38h7b6447c_0
     matplotlib                3.2.2                         0
     matplotlib-base           3.2.2            py38h4fdacc2_0
     mccabe                    0.6.1                    py38_1
     mistune                   0.8.4           py38h7b6447c_1000
     mock                      4.0.2                      py_0
     more-itertools            8.4.0                      py_0
     mpc                       1.1.0                h10f8cd9_1
     mpfr                      4.0.2                hb69a4c5_1
     mpmath                    1.1.0                    py38_0
     msgpack-python            1.0.0            py38hfd86e86_1
     multipledispatch          0.6.0                    py38_0
     nbconvert                 5.6.1                    py38_0
     nbformat                  5.0.7                      py_0
     ncurses                   6.2                  he6710b0_1
     networkx                  2.4                        py_1
     nltk                      3.5                        py_0
     nomkl                     3.0                           0
     nose                      1.3.7                    py38_2
     notebook                  6.0.3                    py38_0
     numexpr                   2.7.1            py38h7ea95a0_0
     numpy                     1.18.5           py38h7130bb8_0
     numpy-base                1.18.5           py38h2f8d375_0
     numpydoc                  1.1.0                      py_0
     olefile                   0.46                       py_0
     openblas                  0.3.10                        0
     openblas-devel            0.3.10                        0
     openpyxl                  3.0.4                      py_0
     openssl                   1.1.1g               h7b6447c_0
     packaging                 20.4                       py_0
     pandas                    1.0.5            py38h0573a6f_0
     pandoc                    2.2.1                         0
     pandocfilters             1.4.2                    py38_1
     parso                     0.7.0                      py_0
     partd                     1.1.0                      py_0
     patchelf                  0.11                 he6710b0_0
     path                      13.1.0                   py38_0
     path.py                   12.4.0                        0
     pathlib2                  2.3.5                    py38_0
     patsy                     0.5.1                    py38_0
     pcre                      8.44                 he6710b0_0
     pep8                      1.7.1                    py38_0
     pexpect                   4.8.0                    py38_0
     pickleshare               0.7.5                    py38_0
     pillow                    7.2.0            py38haac5956_0
     pip                       20.1.1                   py38_1
     pixman                    0.40.0               h7b6447c_0
     pkginfo                   1.5.0.1                  py38_0
     pluggy                    0.13.1                   py38_0
     ply                       3.11                     py38_0
     prometheus_client         0.8.0                      py_0
     prompt-toolkit            3.0.5                      py_0
     prompt_toolkit            3.0.5                         0
     psutil                    5.7.0            py38h7b6447c_0
     ptyprocess                0.6.0                    py38_0
     py                        1.9.0                      py_0
     py-lief                   0.10.1           py38h403a769_0
     pycodestyle               2.6.0                      py_0
     pycosat                   0.6.3            py38h7b6447c_1
     pycparser                 2.20                       py_2
     pycurl                    7.43.0.5         py38h1ba5d50_0
     pyflakes                  2.2.0                      py_0
     pygments                  2.6.1                      py_0
     pylint                    2.5.3                    py38_0
     pyodbc                    4.0.30           py38he6710b0_0
     pyopenssl                 19.1.0                     py_1
     pyparsing                 2.4.7                      py_0
     pyrsistent                0.16.0           py38h7b6447c_0
     pysocks                   1.7.1                    py38_0
     pytables                  3.6.1            py38h9fd0a39_0
     pytest                    5.4.3                    py38_0
     python                    3.8.3                ha7b6439_2
     python-dateutil           2.8.1                      py_0
     python-libarchive-c       2.9                        py_0
     pytz                      2020.1                     py_0
     pywavelets                1.1.1            py38h7b6447c_0
     pyyaml                    5.3.1            py38h7b6447c_1
     pyzmq                     19.0.1           py38he6710b0_1
     readline                  8.0                  h7b6447c_0
     regex                     2020.6.8         py38h7b6447c_0
     requests                  2.24.0                     py_0
     ruamel_yaml               0.15.87          py38h7b6447c_1
     scikit-image              0.16.2           py38h0573a6f_0
     scikit-learn              0.23.1           py38h7ea95a0_0
     scipy                     1.5.0            py38habc2bb6_0
     seaborn                   0.10.1                     py_0
     send2trash                1.5.0                    py38_0
     setuptools                49.2.0                   py38_0
     simplegeneric             0.8.1                    py38_2
     singledispatch            3.4.0.3                  py38_0
     six                       1.15.0                     py_0
     snappy                    1.1.8                he6710b0_0
     snowballstemmer           2.0.0                      py_0
     sortedcollections         1.2.1                      py_0
     sortedcontainers          2.2.2                      py_0
     soupsieve                 2.0.1                      py_0
     sphinx                    3.1.2                      py_0
     sphinxcontrib             1.0                      py38_1
     sphinxcontrib-applehelp   1.0.2                      py_0
     sphinxcontrib-devhelp     1.0.2                      py_0
     sphinxcontrib-htmlhelp    1.0.3                      py_0
     sphinxcontrib-jsmath      1.0.1                      py_0
     sphinxcontrib-qthelp      1.0.3                      py_0
     sphinxcontrib-serializinghtml 1.1.4                      py_0
     sphinxcontrib-websupport  1.2.3                      py_0
     sqlalchemy                1.3.18           py38h7b6447c_0
     sqlite                    3.32.3               hbc83047_0
     statsmodels               0.11.1           py38h7b6447c_0
     sympy                     1.6.1                    py38_0
     tblib                     1.6.0                      py_0
     terminado                 0.8.3                    py38_0
     testpath                  0.4.4                      py_0
     threadpoolctl             2.1.0              pyh5ca1d4c_0
     tk                        8.6.10               hbc83047_0
     toml                      0.10.1                     py_0
     toolz                     0.10.0                     py_0
     tornado                   6.0.4            py38h7b6447c_1
     tqdm                      4.47.0                     py_0
     traitlets                 4.3.3                    py38_0
     typing_extensions         3.7.4.2                    py_0
     unicodecsv                0.14.1                   py38_0
     unixodbc                  2.3.7                h2c717c6_0
     urllib3                   1.25.9                     py_0
     wcwidth                   0.2.5                      py_0
     webencodings              0.5.1                    py38_1
     werkzeug                  1.0.1                      py_0
     wheel                     0.34.2                   py38_0
     widgetsnbextension        3.5.1                    py38_0
     wrapt                     1.11.2           py38h7b6447c_0
     xlrd                      1.2.0                      py_0
     xlsxwriter                1.2.9                      py_0
     xlwt                      1.3.0                    py38_0
     xz                        5.2.5                h7b6447c_0
     yaml                      0.2.5                h7b6447c_0
     zeromq                    4.3.2                he6710b0_2
     zict                      2.0.0                      py_0
     zipp                      3.1.0                      py_0
     zlib                      1.2.11               h7b6447c_3
     zope                      1.0                      py38_1
     zope.event                4.4                      py38_0
     zope.interface            4.7.1            py38h7b6447c_0
     zstd                      1.4.5                h0b5b093_0

jakirkham commented 3 years ago

Thanks for writing this up! Could you please share the code used as well? 🙂

mikacashman commented 3 years ago

Adding code:

Code specific to HPC setup. Consists of (1) a bsub run script which sets up the dask-scheduler before calling (2) the python code. Uses conda setup so personal setup needed for my system under a switch (ISMIKA). I can add the input files used for building the issue report if needed/desired. This launches from a batch node and the dask-based code runs on six 32GB V100 GPU nodes.

Python code designed to run cuml and/or skilearn models with flags.

(1) bsub job script

#!/usr/bin/env bash

#BSUB -P SYB106
#BSUB -W 1:30
#BSUB -alloc_flags "gpumps smt4"
#BSUB -nnodes 1
#BSUB -J rfr-mnmg-long-hm-run1
#BSUB -o rfr-mnmg-long-hm-run1.%J.out
#BSUB -q batch-hm

## FILES
inpath="/gpfs/alpine/syb105/world-shared/mcashman/RAPIDS-MNMG/input_data"
infile="long.tsv"
#Options: small.tsv, long.tsv

## Required conda setup for Mikaela
ISMIKA=true
if [ $ISMIKA ] ; then
    # Use for conda env issues, reload conda
    source /gpfs/alpine/syb105/proj-shared/Personal/mcashman/scripts/conda_summit.sh
    ## Clean env
    conda deactivate
    module purge
fi

## Setup
module load gcc/7.4.0
module load python/3.7.0-anaconda3-5.3.0
module load cuda/10.1.243
source activate /gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0

## Dask setup
export PATH=/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/bin:$PATH

#dask workers
WORKERS_PER_NODE=6 #6
#should be equal to -nnodes
NODES=1
WORKERS=$(($WORKERS_PER_NODE*$NODES))
echo WORKERS=$WORKERS

#set your project id
PROJ_ID=syb106
dask_dir=$MEMBERWORK/$PROJ_ID/dask

if [ ! -d "$dask_dir" ]
then
    mkdir $dask_dir
fi

export CUPY_CACHE_DIR=$dask_dir
export OMP_PROC_BIND=FALSE

# clean previous contents
rm -fr ${dask_dir}/*

# Several dask schedulers could run in the same batch node by different users,
# create a random port to reduce port collisions
PORT_SCHED=$(shuf -i 4000-6000 -n 1)
PORT_DASH=$(shuf -i 7000-8999 -n 1)

# saving ports to use them if  launching jupyter lab
echo $PORT_SCHED >> ${dask_dir}/port_sched
echo $PORT_DASH  >> ${dask_dir}/port_dash

HOSTNAME=$(hostname)
IP_ADDRESS=$(hostname -I | awk '{print $2}')
echo 'Running scheduler in'
echo $IP_ADDRESS:$PORT_SCHED
echo
echo 'Running dashboard in'
echo $IP_ADDRESS:$PORT_DASH

dask-scheduler --port ${PORT_SCHED}  --dashboard-address ${PORT_DASH} --interface ib0  --scheduler-file ${dask_dir}/my-scheduler.json &

echo 'BENCHMARK (min sleep)'
sleep 60
echo '...awake'

echo
echo 'Running worker(s) in: '
jsrun -n 1 -c 1 hostname

##=HM (30GB device mem lim)
jsrun -c 1 -g 1 -n ${WORKERS} -r 6 -a 1 --bind rs --smpiargs="off" dask-cuda-worker --scheduler-file ${dask_dir}/my-scheduler.json --local-directory ${dask_dir} --nthreads 1 --memory-limit 85GB --device-memory-limit 30GB  --death-timeout 180 --interface ib0 --enable-nvlink &

#echo $hostname
echo 'BENCHMARK (min sleep)'
sleep 60
echo '...awake'

cd /gpfs/alpine/syb105/proj-shared/Personal/mcashman/Projects/RAPIDS
echo 'BENCHMARK'
#jsrun -c 1 -g 1 -n ${WORKERS} -r 6 -a 1 --smpiargs="none"
python rfr_mnmg_V2.py -in $inpath/$infile --cuml #--skilearn

(2) python code

import os
import numpy as np
import time
import sklearn

import pandas as pd
import cudf
import cuml
import cupy

from sklearn.metrics import accuracy_score
from sklearn import model_selection, datasets

from cuml.dask.common import utils as dask_utils
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import dask_cudf

from cuml.dask.ensemble import RandomForestClassifier as cumlDaskRFC
from cuml.dask.ensemble import RandomForestRegressor as cumlDaskRFR
from sklearn.ensemble import RandomForestClassifier as sklRFC
from sklearn.ensemble import RandomForestRegressor as sklRFR

def main():
    ## Setup arguments
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('-in', action="store", dest="inpath",
        type=str, required=True, help="path of input file")
    parser.add_argument('--skilearn', action='store_true', dest="runSkl",
        required=False, help="Flag for running ski-learn model")
    parser.add_argument('--cuml', action='store_true', dest="runCuml",
        required=False, default="", help="Flag for running cuML model")
    args=parser.parse_args()
    if not (args.runSkl or args.runCuml):
        print("Invalid argument, please select a model to run\n--skilearn and/or --cuml\n")
        exit(1)

    ## Load data
    print("Starting input read...",flush=True)
    stime=time.time()
    data_type = np.float32
    try:
        with open(args.inpath, 'r') as f:
            test_data = cudf.read_csv(f,sep='\t')
            #test_data = np.loadtxt(f,dtype=float,skiprows=1)
    except EnvironmentError: # parent of IOError, OSError *and* WindowsError where available
        print(f'ERROR: can not open input file\n\t{args.inpath}')
        exit(1)
    print(f'Time to load data: {time.time() - stime}',flush=True)

    ## Sample and check input data
    print(f'Type of test_data: {type(test_data)}')
    print(f'----Sample of input data:\n{test_data.iloc[0:4,0:4]}\n',flush=True)

    ## Split data into X and y
    test_data_f32=test_data.astype('float32')
    X=test_data_f32.iloc[0:,0:-1]
    y=test_data_f32.pheno0

    ## Test prints (optional)
    print(f'X: {type(X)}  shape: {X.shape}')
    print(f'y: {type(y)}  shape: {y.shape}')
    print(f'dtypes\nX: {X.dtypes}\ny: {y.dtype}\n',flush=True)

    # Random Forest building parameters
    max_depth = 20
    n_bins = 8 
    n_trees = 1000 

    ## Split train-test
    X_train, X_test, y_train, y_test = model_selection.train_test_split(X,
                                                            y, test_size=0.2)

    ## IF SKLEARN
    if(args.runSkl):
        #Convert data to pandas
        stime=time.time()
        X_train_pd =X_train.to_pandas()
        X_test_pd  =X_test.to_pandas()
        y_train_pd =y_train.to_pandas()
        y_test_pd  =y_test.to_pandas()
        print(f'sklearn time to convert: {time.time()-stime}',flush=True)

        # Use all avilable CPU cores
        stime=time.time()
        skl_model = sklRFR(max_depth=max_depth, n_estimators=n_trees, n_jobs=-1)
        skl_model.fit(X_train_pd, y_train_pd)
        print(f'sklearn fit time: {time.time()-stime}',flush=True)

        # Predict
        stime=time.time()
        skl_y_pred = skl_model.predict(X_test_pd)
        print(f'sklearn predict time: {time.time()-stime}',flush=True)

    ## IF CUML
    if(args.runCuml):
        # Partition with Dask
        stime=time.time()
        n_partitions = n_workers
        print(f'number of paritions = number of workers = {n_partitions}',flush=True)
        # In this case, each worker will train on 1/n_partitions fraction of the data
        X_train_dask = dask_cudf.from_cudf(X_train, npartitions=n_partitions)
        y_train_dask = dask_cudf.from_cudf(y_train, npartitions=n_partitions)

        # Attempt to fix chunks error - Mika
        X_test_dask  = dask_cudf.from_cudf(X_test, npartitions=n_partitions)

        # Persist to cache the data in active memory
        X_train_dask, y_train_dask = dask_utils.persist_across_workers(client,
                                        [X_train_dask, y_train_dask], workers=workers)

        print(f'cuml setup time {time.time()-stime}',flush=True)

        # Build model
        stime = time.time()
        cuml_model = cumlDaskRFR(max_depth=max_depth, n_estimators=n_trees,
                              n_streams=n_streams,n_bins=n_bins)
        cuml_model.fit(X_train_dask, y_train_dask)
        wait(cuml_model.rfs) # Allow asynchronous training tasks to finish
        print(f'cuml fit time {time.time()-stime}',flush=True)

        # Predict
        stime=time.time()
        cuml_y_pred = cuml_model.predict(X_test_dask)
        print(f'cuml predict time {time.time()-stime}',flush=True)

    print("===== Accuracy Metrics =====",flush=True)
    # Due to randomness in the algorithm, you may see slight variation in accuracies
    if(args.runSkl):
        print("-----SKLearn")
        print(f'y_test_pd: {type(y_test_pd)}  OF  {y_test_pd.dtype}')
        print(f'skl_y_pred: {type(skl_y_pred)}  OF  {skl_y_pred.dtype}')
        print("SKLearn MSE:  ", sklearn.metrics.mean_squared_error(y_test_pd, skl_y_pred))
        print("SKLearn r2:  ", sklearn.metrics.r2_score(y_test_pd, skl_y_pred))

    if(args.runCuml):
        print("-----CuML")
        print(f'y_test: {type(y_test)}  OF  {y_test.dtype}')
        print(f'cuml_y_pred: {type(cuml_y_pred)}  OF  {cuml_y_pred.dtype}')
        print("CuML MSE:     ", cuml.metrics.regression.mean_squared_error(y_test, cuml_y_pred.compute()))
        print("CuML r2:   ", cuml.metrics.regression.r2_score(y_test, cuml_y_pred.compute()))
        #print(f'Type of cuml_y_pred.compute: {type(cuml_y_pred.compute())}')
        #,convert_dtype=True))

    print("DONE",flush=True)

if __name__ == '__main__':
    print ("Internal benchmark", flush=True)
    file = os.getenv('MEMBERWORK') + '/syb106/dask/my-scheduler.json'
    client = Client(scheduler_file=file)
    print ("Client information: ", client, flush=True)

    # Query the client for all connected workers
    workers = client.has_what().keys()
    print(f'workers: {workers}')
    n_workers = len(workers)
    print(f'n_workers: {n_workers}')
    n_streams = 8 # Performance optimization

    main()

    client.shutdown()
miroenev commented 3 years ago

Hey @mikacashman, given the high amount of ORLN specific setup its a bit tricky to reproduce your workload.

My guess is that your first error (no dask workers) is intermittently possible because you are ending up with collisions on your port scheduler -- i.e., whenever your shuffle function lands on a used port in the 4k-6k range (possibly by other dask workers). One way to confirm would be to run a series of tests with hard-coded ports that are known to be unused.

Error #2 may have a similar cause as the one above, though again its a bit hard to say definitely without trying your exact setup.

Lastly #3 is a missed heartbeat due to a closed connection, and this is fairly benign.

mikacashman commented 3 years ago

I am going to close this. We have been able to mostly eliminate two of the errors and the still prevalent error (3) remains to be benign. The keys seems to mostly being careful of running more than one job at a time on the same batch node (or any other dask jobs as the batch nodes are shared). There is also a notable lag time found between job death and all dask scheduler code being cleaned up. So longer wait time between jobs has appeared beneficial. Thanks for the comments.