[QST] Dask scheduler with MNMG RFR codebase

Hi there, I have been working with support from NVIDIA-RAPIDS with ORNL on slack and was advised to post my issues here.

I have been observing some flaky behavior when running a MNMG version of the RF-regression code I am working with (using dask). I have run multiple runs back-to-back on two different data sets (at least 5 runs on each) and have been recording the behavior. Below are reports on three such flaky errors I frequently encounter (some fatal to functionality, some not). The non-fatal ones do not always kill the job either which makes them difficult to manage.

Error#1 - 0 workers (sometimes I end up with 0 workers when I should have 6)

Client information:  <Client: 'tcp://10.41.0.41:5749' processes=0 threads=0, memory=0 B> 
workers: dict_keys([]) 
n_workers: 0
[...]
number of paritions = number of workers = 0 
distributed.scheduler - INFO - Remove client Client-17a28da4-17a6-11eb-b90d-70e284144aab 
distributed.scheduler - INFO - Remove client Client-17a28da4-17a6-11eb-b90d-70e284144aab 
distributed.scheduler - INFO - Close client connection: Client-17a28da4-17a6-11eb-b90d-70e284144aab 
Traceback (most recent call last): 
  File "rfr_mnmg_V2.py", line 169, in <module> 
    main() 
  File "rfr_mnmg_V2.py", line 104, in main 
    X_train_dask = dask_cudf.from_cudf(X_train, npartitions=n_partitions) 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/dask_cudf/core.py", line 643, in from_cudf 
    name=name, 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/dask/dataframe/io/io.py", line 206, in from_pandas 
    chunksize = int(ceil(nrows / npartitions)) 
ZeroDivisionError: division by zero 
distributed.scheduler - INFO - End scheduler at 'tcp://10.41.0.41:5749' 
[...]

Error#2 - Address in use (similar trace repeated for 5/6 workers in this example)

OSError: [Errno 98] Address already in use 
distributed.scheduler - INFO - Clear task state 
distributed.scheduler - INFO - Clear task state 
tornado.application - ERROR - Exception in callback functools.partial(<function TCPServer._handle_connection.<locals>.<lambda> at 0x200213d997a0>, <Task finished coro=<BaseTCPListener._handle_stream() done, defined at /gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/distributed/comm/tcp.py:437> exception=OSError(98, 'Address already in use')>) 
Traceback (most recent call last): 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback 
    ret = callback() 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/tornado/tcpserver.py", line 327, in <lambda> 
    gen.convert_yielded(future), lambda f: f.result() 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/distributed/comm/tcp.py", line 447, in _handle_stream 
    await self.comm_handler(comm) 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/distributed/core.py", line 443, in handle_comm 
    await self 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/distributed/core.py", line 290, in _ 
    await self.start() 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/distributed/scheduler.py", line 1476, in start 
    addr, allow_offload=False, **self.security.get_listen_args("scheduler") 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/distributed/core.py", line 420, in listen 
    **kwargs, 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/distributed/comm/core.py", line 172, in _ 
    await self.start() 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/distributed/comm/tcp.py", line 413, in start 
    self.port, address=self.ip, backlog=backlog 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/tornado/netutil.py", line 174, in bind_sockets 
    sock.bind(sockaddr)

Error#3 - CommClosedError (this one doesn't appear fatal)

[...]
tornado.application - ERROR - Exception in callback <bound method Client._heartbeat of <Client: 'tcp://10.41.0.45:4048' processes=6 threads=6, memory=510.00 GB>> 
Traceback (most recent call last): 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/tornado/ioloop.py", line 907, in _run 
    return self.callback() 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/distributed/client.py", line 1165, in _heartbeat 
    self.scheduler_comm.send({"op": "heartbeat-client"}) 
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/lib/python3.7/site-packages/distributed/batched.py", line 117, in send 
    raise CommClosedError 
distributed.comm.core.CommClosedError 
distributed.scheduler - INFO - End scheduler at 'tcp://10.41.0.45:4048'

Some run information in addition to the env detail at the end: This is being run on the summit supercomputer at ORNL (powerpc - V100 GPUs). My run script uses a dask scheduler on 6 GPU nodes. I have a 60 second sleep after running dask-scheduler and another 60 second sleep after the jsrun command before running the RAPIDS python script. I can provide further information or code if needed.

jsrun -c 1 -g 1 -n 6 -r 6 -a 1 --bind rs --smpiargs="off" dask-cuda-worker --scheduler-file ${dask_dir}/my-scheduler.json --local-directory ${dask_dir} --nthreads 1 --memory-limit 85GB --device-memory-limit 30GB  --death-timeout 180 --interface ib0 --enable-nvlink

I have two further questions (feel free to direct me to open separate issues if desired).

Question#1: Is it possible to run multiple of these dask-based jobs in parallel? When I have tried I end up with more than 6 workers being reported (I attempt to run 5 at the same time which leads to 30 workers 5Jobs*6Workers) and leads to errors. Is there anyway to run several of these dask-scheduler based jobs at once? I have tried adding a uniq identifier to the dask-scheduler file and using a unique directory for each dask files, but neither method resulted in success for me.
Question#2: Is there a way to redirect the dask related output to a different file than the standard out file? The dask-scheduler reports a lot of information (that I need to keep for errors such as these) but my bash and python prints get buried quickly. Of course I could print to a specific file in bash/python, but I would rather simply redirect the dask output if there is a known way to do that.

Thanks for any guidance in advance.

Click here to see environment details


     **git***
     Not inside a git repository

     ***OS Information***
     NAME="Red Hat Enterprise Linux Server"
     VERSION="7.6 (Maipo)"
     ID="rhel"
     ID_LIKE="fedora"
     VARIANT="Server"
     VARIANT_ID="server"
     VERSION_ID="7.6"
     PRETTY_NAME="Red Hat Enterprise Linux Server 7.6 (Maipo)"
     ANSI_COLOR="0;31"
     CPE_NAME="cpe:/o:redhat:enterprise_linux:7.6:GA:server"
     HOME_URL="https://www.redhat.com/"
     BUG_REPORT_URL="https://bugzilla.redhat.com/"

     REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
     REDHAT_BUGZILLA_PRODUCT_VERSION=7.6
     REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
     REDHAT_SUPPORT_PRODUCT_VERSION="7.6"
     Red Hat Enterprise Linux Server release 7.6 (Maipo)
     Red Hat Enterprise Linux Server release 7.6 (Maipo)
     Linux login1 4.14.0-115.21.2.el7a.ppc64le #1 SMP Thu May 7 22:22:31 UTC 2020 ppc64le ppc64le ppc64le GNU/Linux

     ***GPU Information***
     Tue Oct 27 15:52:38 2020
     +-----------------------------------------------------------------------------+
     | NVIDIA-SMI 418.116.00   Driver Version: 418.116.00   CUDA Version: 10.1     |
     |-------------------------------+----------------------+----------------------+
     | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
     | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
     |===============================+======================+======================|
     |   0  Tesla V100-SXM2...  On   | 00000035:03:00.0 Off |                    2 |
     | N/A   36C    P0    37W / 300W |      0MiB / 16130MiB |      0%   E. Process |
     +-------------------------------+----------------------+----------------------+
     |   1  Tesla V100-SXM2...  On   | 00000035:04:00.0 Off |                    0 |
     | N/A   42C    P0    38W / 300W |      0MiB / 16130MiB |      0%   E. Process |
     +-------------------------------+----------------------+----------------------+

     +-----------------------------------------------------------------------------+
     | Processes:                                                       GPU Memory |
     |  GPU       PID   Type   Process name                             Usage      |
     |=============================================================================|
     |  No running processes found                                                 |
     +-----------------------------------------------------------------------------+

     ***CPU***
     Architecture:          ppc64le
     Byte Order:            Little Endian
     CPU(s):                128
     On-line CPU(s) list:   0-127
     Thread(s) per core:    4
     Core(s) per socket:    16
     Socket(s):             2
     NUMA node(s):          6
     Model:                 2.1 (pvr 004e 1201)
     Model name:            POWER9, altivec supported
     CPU max MHz:           3800.0000
     CPU min MHz:           2300.0000
     L1d cache:             32K
     L1i cache:             32K
     L2 cache:              512K
     L3 cache:              10240K
     NUMA node0 CPU(s):     0-63
     NUMA node8 CPU(s):     64-127
     NUMA node252 CPU(s):
     NUMA node253 CPU(s):
     NUMA node254 CPU(s):
     NUMA node255 CPU(s):

     ***CMake***
which: no cmake in (/gpfs/alpine/syb105/proj-shared/Personal/mcashman/Apps/summit/bin:/gpfs/alpine/syb105/proj-shared/Personal/mcashman/Apps/summit/condabin:/sw/sources/lsf-tools/2.0/summit/bin:/sw/summit/xalt/1.2.0/bin:/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/darshan-runtime-3.1.7-cnvxicgf5j4ap64qi6v5gxp67hmrjz43/bin:/sw/sources/hpss/bin:/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/xl-16.1.1-5/spectrum-mpi-10.3.1.2-20200121-p6nrnt6vtvkn356wqg6f74n6jspnpjd2/bin:/sw/summit/xl/16.1.1-5/xlC/16.1.1/bin:/sw/summit/xl/16.1.1-5/xlf/16.1.1/bin:/opt/ibm/spectrumcomputing/lsf/10.1.0.9/linux3.10-glibc2.17-ppc64le-csm/etc:/opt/ibm/spectrumcomputing/lsf/10.1.0.9/linux3.10-glibc2.17-ppc64le-csm/bin:/opt/ibm/csm/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibm/flightlog/bin:/opt/ibutils/bin:/opt/ibm/spectrum_mpi/jsm_pmix/bin:/opt/puppetlabs/bin:/usr/lpp/mmfs/bin)

     ***g++***
     /usr/bin/g++
     g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-37)
     Copyright (C) 2015 Free Software Foundation, Inc.
     This is free software; see the source for copying conditions.  There is NO
     warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

     ***nvcc***
which: no nvcc in (/gpfs/alpine/syb105/proj-shared/Personal/mcashman/Apps/summit/bin:/gpfs/alpine/syb105/proj-shared/Personal/mcashman/Apps/summit/condabin:/sw/sources/lsf-tools/2.0/summit/bin:/sw/summit/xalt/1.2.0/bin:/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/darshan-runtime-3.1.7-cnvxicgf5j4ap64qi6v5gxp67hmrjz43/bin:/sw/sources/hpss/bin:/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/xl-16.1.1-5/spectrum-mpi-10.3.1.2-20200121-p6nrnt6vtvkn356wqg6f74n6jspnpjd2/bin:/sw/summit/xl/16.1.1-5/xlC/16.1.1/bin:/sw/summit/xl/16.1.1-5/xlf/16.1.1/bin:/opt/ibm/spectrumcomputing/lsf/10.1.0.9/linux3.10-glibc2.17-ppc64le-csm/etc:/opt/ibm/spectrumcomputing/lsf/10.1.0.9/linux3.10-glibc2.17-ppc64le-csm/bin:/opt/ibm/csm/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibm/flightlog/bin:/opt/ibutils/bin:/opt/ibm/spectrum_mpi/jsm_pmix/bin:/opt/puppetlabs/bin:/usr/lpp/mmfs/bin)

     ***Python***
     /gpfs/alpine/syb105/proj-shared/Personal/mcashman/Apps/summit/bin/python
     Python 3.8.3

     ***Environment Variables***
     PATH                            : /gpfs/alpine/syb105/proj-shared/Personal/mcashman/Apps/summit/bin:/gpfs/alpine/syb105/proj-shared/Personal/mcashman/Apps/summit/condabin:/sw/sources/lsf-tools/2.0/summit/bin:/sw/summit/xalt/1.2.0/bin:/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/darshan-runtime-3.1.7-cnvxicgf5j4ap64qi6v5gxp67hmrjz43/bin:/sw/sources/hpss/bin:/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/xl-16.1.1-5/spectrum-mpi-10.3.1.2-20200121-p6nrnt6vtvkn356wqg6f74n6jspnpjd2/bin:/sw/summit/xl/16.1.1-5/xlC/16.1.1/bin:/sw/summit/xl/16.1.1-5/xlf/16.1.1/bin:/opt/ibm/spectrumcomputing/lsf/10.1.0.9/linux3.10-glibc2.17-ppc64le-csm/etc:/opt/ibm/spectrumcomputing/lsf/10.1.0.9/linux3.10-glibc2.17-ppc64le-csm/bin:/opt/ibm/csm/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibm/flightlog/bin:/opt/ibutils/bin:/opt/ibm/spectrum_mpi/jsm_pmix/bin:/opt/puppetlabs/bin:/usr/lpp/mmfs/bin
     LD_LIBRARY_PATH                 : /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/darshan-runtime-3.1.7-cnvxicgf5j4ap64qi6v5gxp67hmrjz43/lib:/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/xl-16.1.1-5/spectrum-mpi-10.3.1.2-20200121-p6nrnt6vtvkn356wqg6f74n6jspnpjd2/lib:/sw/summit/xl/16.1.1-5/xlsmp/5.1.1/lib:/sw/summit/xl/16.1.1-5/xlmass/9.1.1/lib:/sw/summit/xl/16.1.1-5/xlC/16.1.1/lib:/sw/summit/xl/16.1.1-5/xlf/16.1.1/lib:/sw/summit/xl/16.1.1-5/lib:/opt/ibm/spectrumcomputing/lsf/10.1.0.9/linux3.10-glibc2.17-ppc64le-csm/lib
     NUMBAPRO_NVVM                   :
     NUMBAPRO_LIBDEVICE              :
     CONDA_PREFIX                    : /gpfs/alpine/syb105/proj-shared/Personal/mcashman/Apps/summit
     PYTHON_PATH                     :

     ***conda packages***
     /gpfs/alpine/syb105/proj-shared/Personal/mcashman/Apps/summit/bin/conda
     # packages in environment at /gpfs/alpine/syb105/proj-shared/Personal/mcashman/Apps/summit:
     #
     # Name                    Version                   Build  Channel
     _ipyw_jlab_nb_ext_conf    0.1.0                    py38_0
     _libgcc_mutex             0.1                        main
     alabaster                 0.7.12                     py_0
     anaconda                  2020.07                  py38_0
     anaconda-client           1.7.2                    py38_0
     anaconda-project          0.8.4                      py_0
     asn1crypto                1.3.0                    py38_0
     astroid                   2.4.2                    py38_0
     astropy                   4.0.1.post1      py38h7b6447c_1
     attrs                     19.3.0                     py_0
     babel                     2.8.0                      py_0
     backcall                  0.2.0                      py_0
     backports                 1.0                        py_2
     backports.functools_lru_cache 1.6.1                      py_0
     backports.shutil_get_terminal_size 1.0.0                    py38_2
     backports.tempfile        1.0                        py_1
     backports.weakref         1.0.post1                  py_1
     beautifulsoup4            4.9.1                    py38_0
     bitarray                  1.4.0            py38h7b6447c_0
     bkcharts                  0.2                      py38_0
     blas                      1.0                    openblas
     bleach                    3.1.5                      py_0
     blosc                     1.19.0               hd408876_0
     bokeh                     2.1.1                    py38_0
     boto                      2.49.0                   py38_0
     bottleneck                1.3.2            py38heb32a55_0
     brotlipy                  0.7.0           py38h7b6447c_1000
     bzip2                     1.0.8                h7b6447c_0
     ca-certificates           2020.6.24                     0
     cairo                     1.14.12              h8948797_3
     certifi                   2020.6.20                py38_0
     cffi                      1.14.0           py38he30daa8_1
     chardet                   3.0.4                 py38_1003
     click                     7.1.2                      py_0
     cloudpickle               1.5.0                      py_0
     clyent                    1.2.2                    py38_1
     colorama                  0.4.3                      py_0
     conda                     4.9.1            py38h6ffa863_0
     conda-build               3.18.11                  py38_0
     conda-env                 2.6.0                         1
     conda-package-handling    1.6.1            py38h7b6447c_0
     conda-verify              3.4.2                      py_1
     contextlib2               0.6.0.post1                py_0
     cryptography              2.9.2            py38h1ba5d50_0
     curl                      7.71.1               hbc83047_1
     cycler                    0.10.0                   py38_0
     cython                    0.29.21          py38he6710b0_0
     cytoolz                   0.10.1           py38h7b6447c_0
     dask                      2.20.0                     py_0
     dask-core                 2.20.0                     py_0
     decorator                 4.4.2                      py_0
     defusedxml                0.6.0                      py_0
     distributed               2.20.0                   py38_0
     docutils                  0.16                     py38_1
     entrypoints               0.3                      py38_0
     et_xmlfile                1.0.1                   py_1001
     expat                     2.2.9                he6710b0_2
     fastcache                 1.1.0            py38h7b6447c_0
     filelock                  3.0.12                     py_0
     flask                     1.1.2                      py_0
     fontconfig                2.13.0               h9420a91_0
     freetype                  2.10.2               h5ab3b9f_0
     fsspec                    0.7.4                      py_0
     future                    0.18.2                   py38_1
     get_terminal_size         1.0.0                         0
     gevent                    20.6.2           py38h7b6447c_0
     glib                      2.65.0               h3eb4bd4_0
     glob2                     0.7                        py_0
     gmp                       6.1.2                h7f7056e_2
     gmpy2                     2.0.8            py38hd5f6e3b_3
     greenlet                  0.4.16           py38h7b6447c_0
     h5py                      2.10.0           py38h7918eee_0
     hdf5                      1.10.4               hb1b8bf9_0
     heapdict                  1.0.1                      py_0
     html5lib                  1.1                        py_0
     icu                       58.2                 he6710b0_3
     idna                      2.10                       py_0
     imageio                   2.9.0                      py_0
     imagesize                 1.2.0                      py_0
     importlib-metadata        1.7.0                    py38_0
     importlib_metadata        1.7.0                         0
     ipykernel                 5.3.2            py38h5ca1d4c_0
     ipython                   7.16.1           py38h5ca1d4c_0
     ipython_genutils          0.2.0                    py38_0
     ipywidgets                7.5.1                      py_0
     isort                     4.3.21                   py38_0
     itsdangerous              1.1.0                      py_0
     jbig                      2.1                  h14c3975_0
     jdcal                     1.4.1                      py_0
     jedi                      0.17.1                   py38_0
     jinja2                    2.11.2                     py_0
     joblib                    0.16.0                     py_0
     jpeg                      9b                   hcb7ba68_2
     json5                     0.9.5                      py_0
     jsonschema                3.2.0                    py38_0
     jupyter                   1.0.0                    py38_7
     jupyter_client            6.1.6                      py_0
     jupyter_console           6.1.0                      py_0
     jupyter_core              4.6.1                    py38_0
     jupyterlab                2.1.5                      py_0
     jupyterlab_server         1.2.0                      py_0
     kiwisolver                1.2.0            py38hfd86e86_0
     krb5                      1.18.2               h597af5e_0
     lazy-object-proxy         1.4.3            py38h7b6447c_0
     lcms2                     2.11                 h396b838_0
     ld_impl_linux-ppc64le     2.33.1               h0f24833_7
     libarchive                3.4.2                h62408e4_0
     libcurl                   7.71.1               h20c2e04_1
     libedit                   3.1.20191231         h14c3975_1
     libffi                    3.3                  he6710b0_2
     libgcc-ng                 8.2.0                h822a55f_1
     libgfortran-ng            7.3.0                h822a55f_1
     liblief                   0.10.1               he6710b0_0
     libopenblas               0.3.10               h5a2b251_0
     libpng                    1.6.37               hbc83047_0
     libsodium                 1.0.18               h7b6447c_0
     libssh2                   1.9.0                h1ba5d50_1
     libstdcxx-ng              8.2.0                h822a55f_1
     libtiff                   4.1.0                h2733197_1
     libuuid                   1.0.3                h1bed415_2
     libxcb                    1.14                 h7b6447c_0
     libxml2                   2.9.10               he19cac6_1
     libxslt                   1.1.34               hc22bd24_0
     locket                    0.2.0                    py38_1
     lxml                      4.5.2            py38hefd8a0e_0
     lz4-c                     1.9.2                he6710b0_0
     lzo                       2.10                 h7b6447c_2
     markupsafe                1.1.1            py38h7b6447c_0
     matplotlib                3.2.2                         0
     matplotlib-base           3.2.2            py38h4fdacc2_0
     mccabe                    0.6.1                    py38_1
     mistune                   0.8.4           py38h7b6447c_1000
     mock                      4.0.2                      py_0
     more-itertools            8.4.0                      py_0
     mpc                       1.1.0                h10f8cd9_1
     mpfr                      4.0.2                hb69a4c5_1
     mpmath                    1.1.0                    py38_0
     msgpack-python            1.0.0            py38hfd86e86_1
     multipledispatch          0.6.0                    py38_0
     nbconvert                 5.6.1                    py38_0
     nbformat                  5.0.7                      py_0
     ncurses                   6.2                  he6710b0_1
     networkx                  2.4                        py_1
     nltk                      3.5                        py_0
     nomkl                     3.0                           0
     nose                      1.3.7                    py38_2
     notebook                  6.0.3                    py38_0
     numexpr                   2.7.1            py38h7ea95a0_0
     numpy                     1.18.5           py38h7130bb8_0
     numpy-base                1.18.5           py38h2f8d375_0
     numpydoc                  1.1.0                      py_0
     olefile                   0.46                       py_0
     openblas                  0.3.10                        0
     openblas-devel            0.3.10                        0
     openpyxl                  3.0.4                      py_0
     openssl                   1.1.1g               h7b6447c_0
     packaging                 20.4                       py_0
     pandas                    1.0.5            py38h0573a6f_0
     pandoc                    2.2.1                         0
     pandocfilters             1.4.2                    py38_1
     parso                     0.7.0                      py_0
     partd                     1.1.0                      py_0
     patchelf                  0.11                 he6710b0_0
     path                      13.1.0                   py38_0
     path.py                   12.4.0                        0
     pathlib2                  2.3.5                    py38_0
     patsy                     0.5.1                    py38_0
     pcre                      8.44                 he6710b0_0
     pep8                      1.7.1                    py38_0
     pexpect                   4.8.0                    py38_0
     pickleshare               0.7.5                    py38_0
     pillow                    7.2.0            py38haac5956_0
     pip                       20.1.1                   py38_1
     pixman                    0.40.0               h7b6447c_0
     pkginfo                   1.5.0.1                  py38_0
     pluggy                    0.13.1                   py38_0
     ply                       3.11                     py38_0
     prometheus_client         0.8.0                      py_0
     prompt-toolkit            3.0.5                      py_0
     prompt_toolkit            3.0.5                         0
     psutil                    5.7.0            py38h7b6447c_0
     ptyprocess                0.6.0                    py38_0
     py                        1.9.0                      py_0
     py-lief                   0.10.1           py38h403a769_0
     pycodestyle               2.6.0                      py_0
     pycosat                   0.6.3            py38h7b6447c_1
     pycparser                 2.20                       py_2
     pycurl                    7.43.0.5         py38h1ba5d50_0
     pyflakes                  2.2.0                      py_0
     pygments                  2.6.1                      py_0
     pylint                    2.5.3                    py38_0
     pyodbc                    4.0.30           py38he6710b0_0
     pyopenssl                 19.1.0                     py_1
     pyparsing                 2.4.7                      py_0
     pyrsistent                0.16.0           py38h7b6447c_0
     pysocks                   1.7.1                    py38_0
     pytables                  3.6.1            py38h9fd0a39_0
     pytest                    5.4.3                    py38_0
     python                    3.8.3                ha7b6439_2
     python-dateutil           2.8.1                      py_0
     python-libarchive-c       2.9                        py_0
     pytz                      2020.1                     py_0
     pywavelets                1.1.1            py38h7b6447c_0
     pyyaml                    5.3.1            py38h7b6447c_1
     pyzmq                     19.0.1           py38he6710b0_1
     readline                  8.0                  h7b6447c_0
     regex                     2020.6.8         py38h7b6447c_0
     requests                  2.24.0                     py_0
     ruamel_yaml               0.15.87          py38h7b6447c_1
     scikit-image              0.16.2           py38h0573a6f_0
     scikit-learn              0.23.1           py38h7ea95a0_0
     scipy                     1.5.0            py38habc2bb6_0
     seaborn                   0.10.1                     py_0
     send2trash                1.5.0                    py38_0
     setuptools                49.2.0                   py38_0
     simplegeneric             0.8.1                    py38_2
     singledispatch            3.4.0.3                  py38_0
     six                       1.15.0                     py_0
     snappy                    1.1.8                he6710b0_0
     snowballstemmer           2.0.0                      py_0
     sortedcollections         1.2.1                      py_0
     sortedcontainers          2.2.2                      py_0
     soupsieve                 2.0.1                      py_0
     sphinx                    3.1.2                      py_0
     sphinxcontrib             1.0                      py38_1
     sphinxcontrib-applehelp   1.0.2                      py_0
     sphinxcontrib-devhelp     1.0.2                      py_0
     sphinxcontrib-htmlhelp    1.0.3                      py_0
     sphinxcontrib-jsmath      1.0.1                      py_0
     sphinxcontrib-qthelp      1.0.3                      py_0
     sphinxcontrib-serializinghtml 1.1.4                      py_0
     sphinxcontrib-websupport  1.2.3                      py_0
     sqlalchemy                1.3.18           py38h7b6447c_0
     sqlite                    3.32.3               hbc83047_0
     statsmodels               0.11.1           py38h7b6447c_0
     sympy                     1.6.1                    py38_0
     tblib                     1.6.0                      py_0
     terminado                 0.8.3                    py38_0
     testpath                  0.4.4                      py_0
     threadpoolctl             2.1.0              pyh5ca1d4c_0
     tk                        8.6.10               hbc83047_0
     toml                      0.10.1                     py_0
     toolz                     0.10.0                     py_0
     tornado                   6.0.4            py38h7b6447c_1
     tqdm                      4.47.0                     py_0
     traitlets                 4.3.3                    py38_0
     typing_extensions         3.7.4.2                    py_0
     unicodecsv                0.14.1                   py38_0
     unixodbc                  2.3.7                h2c717c6_0
     urllib3                   1.25.9                     py_0
     wcwidth                   0.2.5                      py_0
     webencodings              0.5.1                    py38_1
     werkzeug                  1.0.1                      py_0
     wheel                     0.34.2                   py38_0
     widgetsnbextension        3.5.1                    py38_0
     wrapt                     1.11.2           py38h7b6447c_0
     xlrd                      1.2.0                      py_0
     xlsxwriter                1.2.9                      py_0
     xlwt                      1.3.0                    py38_0
     xz                        5.2.5                h7b6447c_0
     yaml                      0.2.5                h7b6447c_0
     zeromq                    4.3.2                he6710b0_2
     zict                      2.0.0                      py_0
     zipp                      3.1.0                      py_0
     zlib                      1.2.11               h7b6447c_3
     zope                      1.0                      py38_1
     zope.event                4.4                      py38_0
     zope.interface            4.7.1            py38h7b6447c_0
     zstd                      1.4.5                h0b5b093_0

Adding code:

Code specific to HPC setup. Consists of (1) a bsub run script which sets up the dask-scheduler before calling (2) the python code. Uses conda setup so personal setup needed for my system under a switch (ISMIKA). I can add the input files used for building the issue report if needed/desired. This launches from a batch node and the dask-based code runs on six 32GB V100 GPU nodes.

Python code designed to run cuml and/or skilearn models with flags.

(1) bsub job script

#!/usr/bin/env bash

#BSUB -P SYB106
#BSUB -W 1:30
#BSUB -alloc_flags "gpumps smt4"
#BSUB -nnodes 1
#BSUB -J rfr-mnmg-long-hm-run1
#BSUB -o rfr-mnmg-long-hm-run1.%J.out
#BSUB -q batch-hm

## FILES
inpath="/gpfs/alpine/syb105/world-shared/mcashman/RAPIDS-MNMG/input_data"
infile="long.tsv"
#Options: small.tsv, long.tsv

## Required conda setup for Mikaela
ISMIKA=true
if [ $ISMIKA ] ; then
    # Use for conda env issues, reload conda
    source /gpfs/alpine/syb105/proj-shared/Personal/mcashman/scripts/conda_summit.sh
    ## Clean env
    conda deactivate
    module purge
fi

## Setup
module load gcc/7.4.0
module load python/3.7.0-anaconda3-5.3.0
module load cuda/10.1.243
source activate /gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0

## Dask setup
export PATH=/gpfs/alpine/world-shared/stf011/nvrapids_0.14_gcc_7.4.0/bin:$PATH

#dask workers
WORKERS_PER_NODE=6 #6
#should be equal to -nnodes
NODES=1
WORKERS=$(($WORKERS_PER_NODE*$NODES))
echo WORKERS=$WORKERS

#set your project id
PROJ_ID=syb106
dask_dir=$MEMBERWORK/$PROJ_ID/dask

if [ ! -d "$dask_dir" ]
then
    mkdir $dask_dir
fi

export CUPY_CACHE_DIR=$dask_dir
export OMP_PROC_BIND=FALSE

# clean previous contents
rm -fr ${dask_dir}/*

# Several dask schedulers could run in the same batch node by different users,
# create a random port to reduce port collisions
PORT_SCHED=$(shuf -i 4000-6000 -n 1)
PORT_DASH=$(shuf -i 7000-8999 -n 1)

# saving ports to use them if  launching jupyter lab
echo $PORT_SCHED >> ${dask_dir}/port_sched
echo $PORT_DASH  >> ${dask_dir}/port_dash

HOSTNAME=$(hostname)
IP_ADDRESS=$(hostname -I | awk '{print $2}')
echo 'Running scheduler in'
echo $IP_ADDRESS:$PORT_SCHED
echo
echo 'Running dashboard in'
echo $IP_ADDRESS:$PORT_DASH

dask-scheduler --port ${PORT_SCHED}  --dashboard-address ${PORT_DASH} --interface ib0  --scheduler-file ${dask_dir}/my-scheduler.json &

echo 'BENCHMARK (min sleep)'
sleep 60
echo '...awake'

echo
echo 'Running worker(s) in: '
jsrun -n 1 -c 1 hostname

##=HM (30GB device mem lim)
jsrun -c 1 -g 1 -n ${WORKERS} -r 6 -a 1 --bind rs --smpiargs="off" dask-cuda-worker --scheduler-file ${dask_dir}/my-scheduler.json --local-directory ${dask_dir} --nthreads 1 --memory-limit 85GB --device-memory-limit 30GB  --death-timeout 180 --interface ib0 --enable-nvlink &

#echo $hostname
echo 'BENCHMARK (min sleep)'
sleep 60
echo '...awake'

cd /gpfs/alpine/syb105/proj-shared/Personal/mcashman/Projects/RAPIDS
echo 'BENCHMARK'
#jsrun -c 1 -g 1 -n ${WORKERS} -r 6 -a 1 --smpiargs="none"
python rfr_mnmg_V2.py -in $inpath/$infile --cuml #--skilearn

(2) python code

import os
import numpy as np
import time
import sklearn

import pandas as pd
import cudf
import cuml
import cupy

from sklearn.metrics import accuracy_score
from sklearn import model_selection, datasets

from cuml.dask.common import utils as dask_utils
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import dask_cudf

from cuml.dask.ensemble import RandomForestClassifier as cumlDaskRFC
from cuml.dask.ensemble import RandomForestRegressor as cumlDaskRFR
from sklearn.ensemble import RandomForestClassifier as sklRFC
from sklearn.ensemble import RandomForestRegressor as sklRFR

def main():
    ## Setup arguments
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('-in', action="store", dest="inpath",
        type=str, required=True, help="path of input file")
    parser.add_argument('--skilearn', action='store_true', dest="runSkl",
        required=False, help="Flag for running ski-learn model")
    parser.add_argument('--cuml', action='store_true', dest="runCuml",
        required=False, default="", help="Flag for running cuML model")
    args=parser.parse_args()
    if not (args.runSkl or args.runCuml):
        print("Invalid argument, please select a model to run\n--skilearn and/or --cuml\n")
        exit(1)

    ## Load data
    print("Starting input read...",flush=True)
    stime=time.time()
    data_type = np.float32
    try:
        with open(args.inpath, 'r') as f:
            test_data = cudf.read_csv(f,sep='\t')
            #test_data = np.loadtxt(f,dtype=float,skiprows=1)
    except EnvironmentError: # parent of IOError, OSError *and* WindowsError where available
        print(f'ERROR: can not open input file\n\t{args.inpath}')
        exit(1)
    print(f'Time to load data: {time.time() - stime}',flush=True)

    ## Sample and check input data
    print(f'Type of test_data: {type(test_data)}')
    print(f'----Sample of input data:\n{test_data.iloc[0:4,0:4]}\n',flush=True)

    ## Split data into X and y
    test_data_f32=test_data.astype('float32')
    X=test_data_f32.iloc[0:,0:-1]
    y=test_data_f32.pheno0

    ## Test prints (optional)
    print(f'X: {type(X)}  shape: {X.shape}')
    print(f'y: {type(y)}  shape: {y.shape}')
    print(f'dtypes\nX: {X.dtypes}\ny: {y.dtype}\n',flush=True)

    # Random Forest building parameters
    max_depth = 20
    n_bins = 8 
    n_trees = 1000 

    ## Split train-test
    X_train, X_test, y_train, y_test = model_selection.train_test_split(X,
                                                            y, test_size=0.2)

    ## IF SKLEARN
    if(args.runSkl):
        #Convert data to pandas
        stime=time.time()
        X_train_pd =X_train.to_pandas()
        X_test_pd  =X_test.to_pandas()
        y_train_pd =y_train.to_pandas()
        y_test_pd  =y_test.to_pandas()
        print(f'sklearn time to convert: {time.time()-stime}',flush=True)

        # Use all avilable CPU cores
        stime=time.time()
        skl_model = sklRFR(max_depth=max_depth, n_estimators=n_trees, n_jobs=-1)
        skl_model.fit(X_train_pd, y_train_pd)
        print(f'sklearn fit time: {time.time()-stime}',flush=True)

        # Predict
        stime=time.time()
        skl_y_pred = skl_model.predict(X_test_pd)
        print(f'sklearn predict time: {time.time()-stime}',flush=True)

    ## IF CUML
    if(args.runCuml):
        # Partition with Dask
        stime=time.time()
        n_partitions = n_workers
        print(f'number of paritions = number of workers = {n_partitions}',flush=True)
        # In this case, each worker will train on 1/n_partitions fraction of the data
        X_train_dask = dask_cudf.from_cudf(X_train, npartitions=n_partitions)
        y_train_dask = dask_cudf.from_cudf(y_train, npartitions=n_partitions)

        # Attempt to fix chunks error - Mika
        X_test_dask  = dask_cudf.from_cudf(X_test, npartitions=n_partitions)

        # Persist to cache the data in active memory
        X_train_dask, y_train_dask = dask_utils.persist_across_workers(client,
                                        [X_train_dask, y_train_dask], workers=workers)

        print(f'cuml setup time {time.time()-stime}',flush=True)

        # Build model
        stime = time.time()
        cuml_model = cumlDaskRFR(max_depth=max_depth, n_estimators=n_trees,
                              n_streams=n_streams,n_bins=n_bins)
        cuml_model.fit(X_train_dask, y_train_dask)
        wait(cuml_model.rfs) # Allow asynchronous training tasks to finish
        print(f'cuml fit time {time.time()-stime}',flush=True)

        # Predict
        stime=time.time()
        cuml_y_pred = cuml_model.predict(X_test_dask)
        print(f'cuml predict time {time.time()-stime}',flush=True)

    print("===== Accuracy Metrics =====",flush=True)
    # Due to randomness in the algorithm, you may see slight variation in accuracies
    if(args.runSkl):
        print("-----SKLearn")
        print(f'y_test_pd: {type(y_test_pd)}  OF  {y_test_pd.dtype}')
        print(f'skl_y_pred: {type(skl_y_pred)}  OF  {skl_y_pred.dtype}')
        print("SKLearn MSE:  ", sklearn.metrics.mean_squared_error(y_test_pd, skl_y_pred))
        print("SKLearn r2:  ", sklearn.metrics.r2_score(y_test_pd, skl_y_pred))

    if(args.runCuml):
        print("-----CuML")
        print(f'y_test: {type(y_test)}  OF  {y_test.dtype}')
        print(f'cuml_y_pred: {type(cuml_y_pred)}  OF  {cuml_y_pred.dtype}')
        print("CuML MSE:     ", cuml.metrics.regression.mean_squared_error(y_test, cuml_y_pred.compute()))
        print("CuML r2:   ", cuml.metrics.regression.r2_score(y_test, cuml_y_pred.compute()))
        #print(f'Type of cuml_y_pred.compute: {type(cuml_y_pred.compute())}')
        #,convert_dtype=True))

    print("DONE",flush=True)

if __name__ == '__main__':
    print ("Internal benchmark", flush=True)
    file = os.getenv('MEMBERWORK') + '/syb106/dask/my-scheduler.json'
    client = Client(scheduler_file=file)
    print ("Client information: ", client, flush=True)

    # Query the client for all connected workers
    workers = client.has_what().keys()
    print(f'workers: {workers}')
    n_workers = len(workers)
    print(f'n_workers: {n_workers}')
    n_streams = 8 # Performance optimization

    main()

    client.shutdown()

rapidsai / cuml

[QST] Dask scheduler with MNMG RFR codebase #3066