rapidsai / dask-cuda

Utilities for Dask and CUDA interactions
https://docs.rapids.ai/api/dask-cuda/stable/
Apache License 2.0
288 stars 91 forks source link

Query on LocalCUDACluster usage #74

Closed pradghos closed 5 years ago

pradghos commented 5 years ago

Hi,

I want to create local cuda dask cluster using LocalCUDACluster

python script is mentioned below -

$ cat test_cluster.py
import os

from dask.distributed import Client
from dask_cuda import LocalCUDACluster

cluster = LocalCUDACluster(scheduler_port=12347,n_workers=2, threads_per_worker=1)

print("cluster status ",cluster.status)
print("cluster infomarion ", cluster)
client = Client(cluster)

print("client information ",client)

$

When I am using python command prompt - it is working for me .

$ python
Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:34:02)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>>
>>> from dask.distributed import Client
>>> from dask_cuda import LocalCUDACluster

>>>
>>> cluster = LocalCUDACluster(scheduler_port=12347,n_workers=2, threads_per_worker=1)
>>> print("cluster status ",cluster.status)
cluster status  running
>>> print("cluster infomarion ", cluster)
cluster infomarion  LocalCUDACluster('tcp://127.0.0.1:12347', workers=2, ncores=2)
>>> client = Client(cluster)
>>> print("client information ",client)
client information  <Client: scheduler='tcp://127.0.0.1:12347' processes=2 cores=2>
>>>

However, when I am trying to run python scripts - using python test_cluster.py

$ python test_cluster.py
cluster status  running
cluster infomarion  LocalCUDACluster('tcp://127.0.0.1:12347', workers=0, ncores=0)
client information  <Client: scheduler='tcp://127.0.0.1:12347' processes=0 cores=0>
Traceback (most recent call last):
  File "/home/pradghos/anaconda3/lib/python3.6/multiprocessing/forkserver.py", line 196, in main
    _serve_one(s, listener, alive_r, old_handlers)
  File "/home/pradghos/anaconda3/lib/python3.6/multiprocessing/forkserver.py", line 231, in _serve_one
    code = spawn._main(child_r)
  File "/home/pradghos/anaconda3/lib/python3.6/multiprocessing/spawn.py", line 114, in _main
    prepare(preparation_data)
  File "/home/pradghos/anaconda3/lib/python3.6/multiprocessing/spawn.py", line 225, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/home/pradghos/anaconda3/lib/python3.6/multiprocessing/spawn.py", line 277, in _fixup_main_from_path
    run_name="__mp_main__")
  File "/home/pradghos/anaconda3/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/home/pradghos/anaconda3/lib/python3.6/runpy.py", line 96, in _run_module_code
Traceback (most recent call last):
  File "/home/pradghos/anaconda3/lib/python3.6/multiprocessing/forkserver.py", line 196, in main
    _serve_one(s, listener, alive_r, old_handlers)
  File "/home/pradghos/anaconda3/lib/python3.6/multiprocessing/forkserver.py", line 231, in _serve_one
    mod_name, mod_spec, pkg_name, script_name)
  ....
  ....
  File "/home/pradghos/anaconda3/lib/python3.6/site-packages/distributed/utils.py", line 316, in f
    self.listener.start()
  File "/home/pradghos/anaconda3/lib/python3.6/site-packages/distributed/comm/tcp.py", line 421, in start
    result[0] = yield future
    self.port, address=self.ip, backlog=backlog
  File "/home/pradghos/anaconda3/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
  File "/home/pradghos/anaconda3/lib/python3.6/site-packages/tornado/netutil.py", line 163, in bind_sockets
    value = future.result()
  File "/home/pradghos/anaconda3/lib/python3.6/site-packages/distributed/deploy/spec.py", line 158, in _start
    self.scheduler = await self.scheduler
  File "/home/pradghos/anaconda3/lib/python3.6/site-packages/distributed/scheduler.py", line 1239, in __await__
    sock.bind(sockaddr)
OSError: [Errno 98] Address already in use
    self.start()
  File "/home/pradghos/anaconda3/lib/python3.6/site-packages/distributed/scheduler.py", line 1200, in start
    self.listen(addr_or_port, listen_args=self.listen_args)
  File "home/pradghos/anaconda3/lib/python3.6/site-packages/distributed/core.py", line 322, in listen
    self.listener.start()
  File "/home/pradghos/anaconda3/lib/python3.6/site-packages/distributed/comm/tcp.py", line 421, in start
    self.port, address=self.ip, backlog=backlog
  File "/home/pradghos/anaconda3/lib/python3.6/site-packages/tornado/netutil.py", line 163, in bind_sockets
    sock.bind(sockaddr)
OSError: [Errno 98] Address already in use
distributed.nanny - WARNING - Worker process 24873 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 24874 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker

Any pointers if I am missing something ? Thanks in advance !

pentschev commented 5 years ago

The spawning of workers is a recursive process, so you need to prevent the code from running itself again. You can do that by checking whether it's the main thread or not, the following modification should work:

import os

from dask.distributed import Client
from dask_cuda import LocalCUDACluster

if __name__ == '__main__':
    cluster = LocalCUDACluster(scheduler_port=12347,n_workers=2, threads_per_worker=1)

    print("cluster status ",cluster.status)
    print("cluster infomarion ", cluster)
    client = Client(cluster)

    print("client information ",client)
pradghos commented 5 years ago

Thanks @pentschev for the input ! I have observed that workers and others are coming up with delay -

 if __name__ == '__main__':
    #cluster = LocalCUDACluster(scheduler_port=12347,n_workers=2, threads_per_worker=1)
    cluster = LocalCUDACluster()

    print("cluster status ",cluster.status)
    print("cluster information ", cluster)
    client = Client(cluster)

    print("client information ",client)

    time.sleep(2)                                      ======> Added sleep(2)
    print("cluster status ",cluster.status)
    print("cluster information ", cluster)
    print("client information ",client)

Log :

(base) [builder@d065228d37d1 ~]$ python test_cluster.py
cluster status  running
cluster information  LocalCUDACluster('tcp://127.0.0.1:38327', workers=0, ncores=0)
client information  <Client: scheduler='tcp://127.0.0.1:38327' processes=0 cores=0>
cluster status  running  -----> after sleep(2)
cluster information  LocalCUDACluster('tcp://127.0.0.1:38327', workers=4, ncores=4)
client information  <Client: scheduler='tcp://127.0.0.1:38327' processes=4 cores=4>
pradghos commented 5 years ago

@pentschev : Another query is about distributing the workload between the workers and multiple gpu -

Code snippet : -

import numpy as np
from dask.delayed import delayed
from pandas.util.testing import assert_frame_equal
import cudf as gd
import dask_cudf as dgd

import time

if __name__ == '__main__':
    cluster = LocalCUDACluster()
    client = Client(cluster)
    time.sleep(5)
    print("cluster status ",cluster.status)
    print("cluster infomarion ", cluster)
    print("client information ",client)

    # no. of gpu and no. of worker is same.   

    nelem = 10000000

    df = gd.DataFrame()
    df["x"] = np.arange(nelem)
    df["y"] = np.random.randint(nelem, size=nelem)

    ddf = dgd.from_cudf(df, npartitions=5)

    delays = ddf.to_delayed()

    assert len(delays) == 5

    # Concat the delayed partitions
    got = gd.concat([d.compute() for d in delays])
    assert_frame_equal(got.to_pandas(), df.to_pandas())

However, I was trying to profile the GPU activities for each GPU -

(base) [builder@d065228d37d1 ~]$ nvprof --print-summary-per-gpu --profile-child-processes python test_cluster2.py
==2368== NVPROF is profiling process 2368, command: python test_cluster2.py
cluster status  running
cluster infomarion  LocalCUDACluster('tcp://127.0.0.1:44105', workers=0, ncores=0)
.....
cluster status  running
cluster infomarion  LocalCUDACluster('tcp://127.0.0.1:44105', workers=4, ncores=4)
client information  <Client: scheduler='tcp://127.0.0.1:44105' processes=4 cores=4>
....
==2368== Profiling application: python test_cluster2.py
==2368== Profiling result:

==2368== Device "Tesla V100-SXM2-16GB (0)"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   41.71%  29.160ms        37  788.11us  1.2800us  7.0561ms  [CUDA memcpy HtoD]
                   37.41%  26.155ms        45  581.21us  1.6640us  6.0504ms  [CUDA memcpy DtoH]
                   13.01%  9.0988ms         1  9.0988ms  9.0988ms  9.0988ms  _ZN6thrust8cuda_cub4core13_kernel_agentINS0_12__merge_sort14BlockSortAgentIPiS5_lZ14multi_col_sortIiEvPKPvPKPhS5_PammbPT_bbP11CUstream_stEUliiE1_NS_6detail17integral_constantIbLb0EEESL_EEbS5_S5_lS5_S5_SI_EEvT0_T1_T2_T3_T4_T5_T6_
                    1.85%  1.2913ms        12  107.61us  103.46us  117.34us  _ZN6thrust8cuda_cub4core13_kernel_agentINS0_12__merge_sort10MergeAgentIPiS5_lZ14multi_col_sortIiEvPKPvPKPhS5_PammbPT_bbP11CUstream_stEUliiE1_NS_6detail17integral_constantIbLb0EEEEEbS5_S5_lS5_S5_SI_PllEEvT0_T1_T2_T3_T4_T5_T6_T7_T8_
                    1.83%  1.2762ms         5  255.23us  249.44us  266.33us  cudapy::cudf::utils::cudautils::gpu_gather$243(Array<__int64, int=1, A, mutable, aligned>, Array<int, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>)
                    0.99%  694.43us        15  46.295us  45.184us  50.656us  [CUDA memcpy DtoD]
                    0.87%  606.08us        16  37.879us  25.024us  218.94us  void kernel_v_v<char, long, long, Equal>(int, char*, long*, long*)
                    0.81%  567.96us         6  94.660us  94.463us  94.783us  cudapy::cudf::utils::cudautils::gpu_arange$241(__int64, __int64, __int64, Array<__int64, int=1, A, mutable, aligned>)
                    0.73%  513.47us        16  32.091us  25.984us  120.70us  void _GLOBAL__N__56_tmpxft_00002643_00000000_7_reductions_compute_70_cpp1_ii_c1104e96::gpu_reduction_op<cudf::detail::wrapper<char, gdf_dtype=7>, cudf::detail::wrapper<char, gdf_dtype=7>, cudf::DeviceMin, cudf::reductions::IdentityLoader>(char const *, unsigned char const *, int, gdf_dtype=7*, cudf::detail::wrapper<char, gdf_dtype=7>, unsigned char const *, cudf::detail::wrapper<char, gdf_dtype=7>)
                    0.50%  351.71us        12  29.309us  22.048us  36.288us  _ZN6thrust8cuda_cub4core13_kernel_agentINS0_12__merge_sort14PartitionAgentIPilZ14multi_col_sortIiEvPKPvPKPhS5_PammbPT_bbP11CUstream_stEUliiE1_EEbS5_S5_lmPlSI_liEEvT0_T1_T2_T3_T4_T5_T6_T7_T8_
                    0.22%  150.37us         1  150.37us  150.37us  150.37us  cudapy::cudf::utils::cudautils::gpu_copy$242(Array<__int64, int=1, A, mutable, aligned>, Array<int, int=1, A, mutable, aligned>)
                    0.07%  48.256us         1  48.256us  48.256us  48.256us  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__tabulate::functor<int*, thrust::system::detail::generic::sequence_detail::sequence_functor<int>, long>, long>, thrust::cuda_cub::__tabulate::functor<int*, thrust::system::detail::generic::sequence_detail::sequence_functor<int>, long>, long>(int, thrust::system::detail::generic::sequence_detail::sequence_functor<int>)
                    0.00%  1.6640us         1  1.6640us  1.6640us  1.6640us  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<void*>, void*>, unsigned long>, thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<void*>, void*>, unsigned long>(thrust::device_ptr<void*>, void*)
                    0.00%  1.2480us         1  1.2480us  1.2480us  1.2480us  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<unsigned char*>, unsigned char*>, unsigned long>, thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<unsigned char*>, unsigned char*>, unsigned long>(thrust::device_ptr<unsigned char*>, unsigned char*)
                    0.00%  1.1200us         1  1.1200us  1.1200us  1.1200us  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<int>, int>, unsigned long>, thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<int>, int>, unsigned long>(thrust::device_ptr<int>, int)

==2368== Device "Tesla V100-SXM2-16GB (1)"
No kernels were profiled.

==2368== Device "Tesla V100-SXM2-16GB (2)"
No kernels were profiled.

==2368== Device "Tesla V100-SXM2-16GB (3)"
No kernels were profiled.

Sorry for long information - I see only GPU (0) is getting used. How do I ensure workload is spread across multiple gpu and is this correct way to verify? Thank you !

pentschev commented 5 years ago

Regarding the status information, this indeed looks like a bug that we had not yet noticed, thanks for reporting.

Your code for GPU distribution is correct, but there is another bug (this time known and discussed in #32) when import cudf (or import dask_cudf, that internally also imports cudf) is imported before the creation of LocalCUDACluster. The simplest solution for your example is just to move those two imports inside __main__ and after Client(), that may suffice for your use case. However, it isn't clear if this works for all possible pipelines, another way to work around that for now is to use the CLI for dask-scheduler and dask-cuda-worker to start those up manually.

mrocklin commented 5 years ago

Also please note that [df.compute() for df in dfs] is sequential. Perhaps you wanted dask.compute(*dfs)

On Mon, Jun 17, 2019 at 10:16 PM Peter Andreas Entschev < notifications@github.com> wrote:

Regarding the status information, this indeed looks like a bug that we had not yet noticed, thanks for reporting.

Your code for GPU distribution is correct, but there is another bug (this time known and discussed in #32 https://github.com/rapidsai/dask-cuda/issues/32) when import cudf (or import dask_cudf, that internally also imports cudf) is imported before the creation of LocalCUDACluster. The simplest solution for your example is just to move those two imports inside main and after Client(), that may suffice for your use case. However, it isn't clear if this works for all possible pipelines, another way to work around that for now is to use the CLI for dask-scheduler and dask-cuda-worker to start those up manually.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rapidsai/dask-cuda/issues/74?email_source=notifications&email_token=AACKZTEZ2U4OUKF26M6SY7DP27WIJA5CNFSM4HYWT7OKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODX4KJGQ#issuecomment-502834330, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTGBOCL6AYQ3BRBEALTP27WIJANCNFSM4HYWT7OA .

pradghos commented 5 years ago

Thanks @pentschev @mrocklin for all the suggestions -

After recommended modification in example snippet -

if __name__ == '__main__':
    #cluster = LocalCUDACluster(scheduler_port=12347,n_workers=2, threads_per_worker=1)
    cluster = LocalCUDACluster()
    print("cluster status ",cluster.status)
    print("cluster infomarion ", cluster)
    client = Client(cluster)

    import cudf as gd
    import dask_cudf as dgd

...
...
    got = gd.concat(dask.compute(*delays))

nvidia-smi usages are showing that all 4 GPUs are getting used -

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     74287      C   python                                       405MiB |
+-----------------------------------------------------------------------------+
Mon Jun 17 23:20:41 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.03    Driver Version: 418.40.03    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000004:04:00.0 Off |                    0 |
| N/A   44C    P0    67W / 300W |   1130MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000004:05:00.0 Off |                    0 |
| N/A   45C    P0    69W / 300W |    325MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000035:03:00.0 Off |                    0 |
| N/A   42C    P0    68W / 300W |    325MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000035:04:00.0 Off |                    0 |
| N/A   48C    P0    69W / 300W |    325MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     74287      C   python                                       805MiB |
|    0     74316      C   /opt/anaconda3/bin/python                    315MiB |
|    1     74314      C   /opt/anaconda3/bin/python                    315MiB |
|    2     74313      C   /opt/anaconda3/bin/python                    315MiB |
|    3     74315      C   /opt/anaconda3/bin/python                    315MiB |
+-----------------------------------------------------------------------------+

But When I use - nvprof --print-summary-per-gpu --profile-child-processes python test_cluster2.py

the result is same as before -


==5018== Device "Tesla V100-SXM2-16GB (0)"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   42.34%  41.962ms       183  229.30us  1.4710us  4.2174ms  [CUDA memcpy DtoH]
                   41.68%  41.305ms       101  408.96us  1.0240us  6.8770ms  [CUDA memcpy HtoD]
                    9.50%  9.4118ms         1  9.4118ms  9.4118ms  9.4118ms  _ZN6thrust8cuda_cub4core13_kernel_agentINS0_12__merge_sort14BlockSortAgentIPiS5_lZ14multi_col_sortIiEvPKPvPKPhS5_PammbPT_bbP11CUstream_stEUliiE1_NS_6detail17integral_constantIbLb0EEESL_EEbS5_S5_lS5_S5_SI_EEvT0_T1_T2_T3_T4_T5_T6_
                    1.31%  1.2946ms        12  107.88us  102.75us  122.37us  _ZN6thrust8cuda_cub4core13_kernel_agentINS0_12__merge_sort10MergeAgentIPiS5_lZ14multi_col_sortIiEvPKPvPKPhS5_PammbPT_bbP11CUstream_stEUliiE1_NS_6detail17integral_constantIbLb0EEEEEbS5_S5_lS5_S5_SI_PllEEvT0_T1_T2_T3_T4_T5_T6_T7_T8_
                    1.30%  1.2842ms         5  256.84us  249.44us  270.91us  cudapy::cudf::utils::cudautils::gpu_gather$243(Array<__int64, int=1, A, mutable, aligned>, Array<int, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>)
                    0.71%  702.33us        15  46.822us  45.567us  50.720us  [CUDA memcpy DtoD]
                    0.62%  618.85us        23  26.906us  1.1840us  219.30us  void kernel_v_v<char, long, long, Equal>(int, char*, long*, long*)
                    0.61%  607.71us        23  26.422us  1.9520us  139.26us  void _GLOBAL__N__56_tmpxft_00002643_00000000_7_reductions_compute_70_cpp1_ii_c1104e96::gpu_reduction_op<cudf::detail::wrapper<char, gdf_dtype=7>, cudf::detail::wrapper<char, gdf_dtype=7>, cudf::DeviceMin, cudf::reductions::IdentityLoader>(char const *, unsigned char const *, int, gdf_dtype=7*, cudf::detail::wrapper<char, gdf_dtype=7>, unsigned char const *, cudf::detail::wrapper<char, gdf_dtype=7>)
                    0.58%  571.77us        10  57.177us  1.2160us  95.423us  cudapy::cudf::utils::cudautils::gpu_arange$241(__int64, __int64, __int64, Array<__int64, int=1, A, mutable, aligned>)
....
....

==5018== Device "Tesla V100-SXM2-16GB (1)"
No kernels were profiled.

==5018== Device "Tesla V100-SXM2-16GB (2)"
No kernels were profiled.

==5018== Device "Tesla V100-SXM2-16GB (3)"
No kernels were profiled.

I can see only GPU(0) is getting profiled and No kernels were profiled for rest of the three GPUs in the system.

am I missing something on nvprof usage side ? Do I need use any other option in nvprof to profile all the GPU activities.

Thanks!

pentschev commented 5 years ago

My apologies the delay in responding here.

After analyzing this issue a little further, indeed nvprof doesn't report anything on GPUs other than 0, but watch nvidia-smi I can see that there is GPU utilization for all GPUs on the machine.

Would you mind doing another test? What I suggest is that you run again your code and watch nvidia-smi during its execution. The differences I noticed were that if import cudf/import dask_cudf happen at the top, I will see all GPUs consuming 11MB and 0% utilization for the entire execution time, whereas GPU 0 reaches over 10GB of consumption and GPU utilization goes up to 100% at times. If I move the imports to after printing cluster information, immediately after that happens, I see GPUs consuming 429MB, with the exception of GPU 0 consuming > 4GB (as it's populating df), after some time I see memory and GPU utilization increasing on other GPUs, but they are really subtle as there's not much computation going on.

I believe the reason for nvprof not reporting utilization on all GPUs is due to dask-cuda using CUDA_VISIBLE_DEVICES environment variable to select which GPU is used by each process. Within the process, the GPU being utilized is seen as GPU 0 at all times, and I think nvprof is using that index during report. I can't confirm if my assumption is correct yet, but I'll try to make a simple example and perhaps file a bug report to nvprof if necessary.

pradghos commented 5 years ago

Thanks @pentschev . That's a very interesting observation. I will try out the suggested test.

pradghos commented 5 years ago

@pentschev : here is the test result -

If I add importstatement at the top -

|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000004:04:00.0 Off |                    0 |
| N/A   41C    P0    52W / 300W |   2293MiB / 16130MiB |     22%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000004:05:00.0 Off |                    0 |
| N/A   42C    P0    39W / 300W |     10MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000035:03:00.0 Off |                    0 |
| N/A   39C    P0    38W / 300W |     10MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000035:04:00.0 Off |                    0 |
| N/A   45C    P0    41W / 300W |     10MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     70918      C   python                                       399MiB |
|    0     70947      C   /opt/anaconda3/bin/python                    591MiB |
|    0     70948      C   /opt/anaconda3/bin/python                    447MiB |
|    0     70950      C   /opt/anaconda3/bin/python                    447MiB |
|    0     70952      C   /opt/anaconda3/bin/python                    399MiB |
+-----------------------------------------------------------------------------+

After client = Client(cluster) step -

| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000004:04:00.0 Off |                    0 |
| N/A   42C    P0    52W / 300W |    906MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000004:05:00.0 Off |                    0 |
| N/A   44C    P0    54W / 300W |    505MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000035:03:00.0 Off |                    0 |
| N/A   41C    P0    53W / 300W |    505MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000035:04:00.0 Off |                    0 |
| N/A   46C    P0    56W / 300W |    697MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     72371      C   python                                       401MiB |
|    0     72393      C   /opt/anaconda3/bin/python                    495MiB |
|    1     72397      C   /opt/anaconda3/bin/python                    495MiB |
|    2     72395      C   /opt/anaconda3/bin/python                    495MiB |
|    3     72394      C   /opt/anaconda3/bin/python                    687MiB |
+-----------------------------------------------------------------------------+

just before client = Client(cluster) steps -

|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000004:04:00.0 Off |                    0 |
| N/A   41C    P0    52W / 300W |   1480MiB / 16130MiB |      7%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000004:05:00.0 Off |                    0 |
| N/A   44C    P0    54W / 300W |    505MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000035:03:00.0 Off |                    0 |
| N/A   40C    P0    53W / 300W |    505MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000035:04:00.0 Off |                    0 |
| N/A   46C    P0    55W / 300W |    505MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     79158      C   python                                       879MiB |
|    0     79178      C   /opt/anaconda3/bin/python                    591MiB |
|    1     79176      C   /opt/anaconda3/bin/python                    495MiB |
|    2     79181      C   /opt/anaconda3/bin/python                    495MiB |
|    3     79177      C   /opt/anaconda3/bin/python                    495MiB |
+-----------------------------------------------------------------------------+
pentschev commented 5 years ago

From the Processes table, it looks like importing cudf after LocalCUDACluster seems to have worked, since we can see that there are python processes on all 4 GPUs.

I know this is not a great solution, but we still don't have one. Does this workaround solve your problem for now?

pentschev commented 5 years ago

The issue with delay for workers to start and cluster to report them was also fixed in https://github.com/rapidsai/dask-cuda/pull/78.

pradghos commented 5 years ago

From the Processes table, it looks like importing cudf after LocalCUDACluster seems to have worked, since we can see that there are python processes on all 4 GPUs.

I know this is not a great solution, but we still don't have one. Does this workaround solve your problem for now?

Sure. Thanks for looking into. Any luck on nvprof issue ?

pentschev commented 5 years ago

Sure. Thanks for looking into. Any luck on nvprof issue ?

Unfortunately, I haven't had the chance yet, will try to do it next week.

pradghos commented 5 years ago

I was trying out to create Multi-node cluster - So started scheduler in another node and tried to create worker using dask-cuda-worker <scheduler-ip>:8786 on a 4GPU node.

From nvidia-smi o/p is like below -


|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000004:04:00.0 Off |                    0 |
| N/A   44C    P0    55W / 300W |    620MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000004:05:00.0 Off |                    0 |
| N/A   46C    P0    57W / 300W |    620MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000035:03:00.0 Off |                    0 |
| N/A   42C    P0    54W / 300W |    620MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000035:04:00.0 Off |                    0 |
| N/A   48C    P0    56W / 300W |    620MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     66586      C   /opt/anaconda3/bin/python                    305MiB |
|    0     66617      C   /opt/anaconda3/bin/python                    305MiB |
|    1     66586      C   /opt/anaconda3/bin/python                    305MiB |
|    1     66620      C   /opt/anaconda3/bin/python                    305MiB |
|    2     66586      C   /opt/anaconda3/bin/python                    305MiB |
|    2     66618      C   /opt/anaconda3/bin/python                    305MiB |
|    3     66586      C   /opt/anaconda3/bin/python                    305MiB |
|    3     66619      C   /opt/anaconda3/bin/python                    305MiB |
+-----------------------------------------------------------------------------+

$ ps -ef | grep 66586
1084      66586  46641  8 09:48 pts/39   00:00:02 /opt/anaconda3/bin/python /opt/anaconda3/bin/dask-cuda-worker 172.18.0.25:8786
1084      66591  66586  0 09:48 pts/39   00:00:00 [sh] <defunct>
1084      66610  66586  0 09:48 pts/39   00:00:00 /opt/anaconda3/bin/python -c from multiprocessing.semaphore_tracker import main;main(79)
1084      66615  66586  1 09:48 pts/39   00:00:00 /opt/anaconda3/bin/python -c from multiprocessing.forkserver import main; main(94, 108, ['distributed', 'pkg_resources'], **{'sys_path': ['/opt/anaconda3/bin', '/opt/anaconda3/lib/python36.zip', '/opt/anaconda3/lib/python3.6', '/opt/anaconda3/lib/python3.6/lib-dynload', '/opt/anaconda3/lib/python3.6/site-packages']})
pradghos  66675  54389  0 09:48 pts/44   00:00:00 grep --color=auto 66586

$ ps -ef |  grep 66617
1084      66617  66615  3 09:48 pts/39   00:00:01 /opt/anaconda3/bin/python -c from multiprocessing.forkserver import main; main(94, 108, ['distributed', 'pkg_resources'], **{'sys_path': ['/opt/anaconda3/bin', '/opt/anaconda3/lib/python36.zip', '/opt/anaconda3/lib/python3.6', '/opt/anaconda3/lib/python3.6/lib-dynload', '/opt/anaconda3/lib/python3.6/site-packages']})
pradghos  66681  54389  0 09:48 pts/44   00:00:00 grep --color=auto 66617

66586 is started on each GPU and 66617, 66620, 66618 and 66619 are probably the worker started on each GPU.

Question here is - 1> is that expected to see process - 66586 ( /opt/anaconda3/bin/python /opt/anaconda3/bin/dask-cuda-worker ) in every GPU and also process-id 66586 has allocated another 305MB from each GPU which may not be required.

@pentschev : Any comment / suggestion on this behavior would help.

Thanks!

pentschev commented 5 years ago

Question here is - 1> is that expected to see process - 66586 ( /opt/anaconda3/bin/python /opt/anaconda3/bin/dask-cuda-worker ) in every GPU and also process-id 66586 has allocated another 305MB from each GPU which may not be required.

It is not expected, there should be one process per GPU. Could you share the output of conda list and, if you're installing dask/distributed/dask-cuda from source, the respective commit which you used for that? Could you also share the command you used to startup from the command line, or a minimal sample of the code used if you started directly from your python script?

pradghos commented 5 years ago

Question here is - 1> is that expected to see process - 66586 ( /opt/anaconda3/bin/python /opt/anaconda3/bin/dask-cuda-worker ) in every GPU and also process-id 66586 has allocated another 305MB from each GPU which may not be required.

It is not expected, there should be one process per GPU. Could you share the output of conda list and, if you're installing dask/distributed/dask-cuda from source, the respective commit which you used for that? Could you also share the command you used to startup from the command line, or a minimal sample of the code used if you started directly from your python script?

@pentschev : Sorry for the delayed response ! I have used dask-cuda 0.8.0 , dask 2.0.0 , distributed 2.0.1 during this recreate. - here is the steps - I have tried -

  1. Started the scheduler using dask-scheduler command from a node.
  2. Started the worker using dask-cuda-worker <scheduler-ip>:8786 on the another node (having GPUs).
  3. collected nvidia-smi o/p in worker node.
pradghos commented 5 years ago

@pentschev: Please let me know if this is not the correct way to create multi node cluster(GPU enabled) using dask and dask-cuda. Thank you !

pentschev commented 5 years ago

Sorry @pradghos, I totally missed your response here. This is strange indeed, I don't think you're doing anything incorrect. I'm going to do the following: close this issue (since the original question has been responded/fixed) and open two new issues for the still unresolved issues you reported here in https://github.com/rapidsai/dask-cuda/issues/74#issuecomment-502957116 and https://github.com/rapidsai/dask-cuda/issues/74#issuecomment-509687986. This way we can better track issues individually instead of end up missing them.

Regardless of that, sorry for not being so responsive here and thanks for reporting the issues, this is very helpful!