Closed pradghos closed 5 years ago
The spawning of workers is a recursive process, so you need to prevent the code from running itself again. You can do that by checking whether it's the main thread or not, the following modification should work:
import os
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
if __name__ == '__main__':
cluster = LocalCUDACluster(scheduler_port=12347,n_workers=2, threads_per_worker=1)
print("cluster status ",cluster.status)
print("cluster infomarion ", cluster)
client = Client(cluster)
print("client information ",client)
Thanks @pentschev for the input !
I have observed that workers
and others are coming up with delay -
if __name__ == '__main__':
#cluster = LocalCUDACluster(scheduler_port=12347,n_workers=2, threads_per_worker=1)
cluster = LocalCUDACluster()
print("cluster status ",cluster.status)
print("cluster information ", cluster)
client = Client(cluster)
print("client information ",client)
time.sleep(2) ======> Added sleep(2)
print("cluster status ",cluster.status)
print("cluster information ", cluster)
print("client information ",client)
Log :
(base) [builder@d065228d37d1 ~]$ python test_cluster.py
cluster status running
cluster information LocalCUDACluster('tcp://127.0.0.1:38327', workers=0, ncores=0)
client information <Client: scheduler='tcp://127.0.0.1:38327' processes=0 cores=0>
cluster status running -----> after sleep(2)
cluster information LocalCUDACluster('tcp://127.0.0.1:38327', workers=4, ncores=4)
client information <Client: scheduler='tcp://127.0.0.1:38327' processes=4 cores=4>
@pentschev : Another query is about distributing the workload between the workers and multiple gpu -
Code snippet : -
import numpy as np
from dask.delayed import delayed
from pandas.util.testing import assert_frame_equal
import cudf as gd
import dask_cudf as dgd
import time
if __name__ == '__main__':
cluster = LocalCUDACluster()
client = Client(cluster)
time.sleep(5)
print("cluster status ",cluster.status)
print("cluster infomarion ", cluster)
print("client information ",client)
# no. of gpu and no. of worker is same.
nelem = 10000000
df = gd.DataFrame()
df["x"] = np.arange(nelem)
df["y"] = np.random.randint(nelem, size=nelem)
ddf = dgd.from_cudf(df, npartitions=5)
delays = ddf.to_delayed()
assert len(delays) == 5
# Concat the delayed partitions
got = gd.concat([d.compute() for d in delays])
assert_frame_equal(got.to_pandas(), df.to_pandas())
However, I was trying to profile the GPU activities for each GPU -
(base) [builder@d065228d37d1 ~]$ nvprof --print-summary-per-gpu --profile-child-processes python test_cluster2.py
==2368== NVPROF is profiling process 2368, command: python test_cluster2.py
cluster status running
cluster infomarion LocalCUDACluster('tcp://127.0.0.1:44105', workers=0, ncores=0)
.....
cluster status running
cluster infomarion LocalCUDACluster('tcp://127.0.0.1:44105', workers=4, ncores=4)
client information <Client: scheduler='tcp://127.0.0.1:44105' processes=4 cores=4>
....
==2368== Profiling application: python test_cluster2.py
==2368== Profiling result:
==2368== Device "Tesla V100-SXM2-16GB (0)"
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 41.71% 29.160ms 37 788.11us 1.2800us 7.0561ms [CUDA memcpy HtoD]
37.41% 26.155ms 45 581.21us 1.6640us 6.0504ms [CUDA memcpy DtoH]
13.01% 9.0988ms 1 9.0988ms 9.0988ms 9.0988ms _ZN6thrust8cuda_cub4core13_kernel_agentINS0_12__merge_sort14BlockSortAgentIPiS5_lZ14multi_col_sortIiEvPKPvPKPhS5_PammbPT_bbP11CUstream_stEUliiE1_NS_6detail17integral_constantIbLb0EEESL_EEbS5_S5_lS5_S5_SI_EEvT0_T1_T2_T3_T4_T5_T6_
1.85% 1.2913ms 12 107.61us 103.46us 117.34us _ZN6thrust8cuda_cub4core13_kernel_agentINS0_12__merge_sort10MergeAgentIPiS5_lZ14multi_col_sortIiEvPKPvPKPhS5_PammbPT_bbP11CUstream_stEUliiE1_NS_6detail17integral_constantIbLb0EEEEEbS5_S5_lS5_S5_SI_PllEEvT0_T1_T2_T3_T4_T5_T6_T7_T8_
1.83% 1.2762ms 5 255.23us 249.44us 266.33us cudapy::cudf::utils::cudautils::gpu_gather$243(Array<__int64, int=1, A, mutable, aligned>, Array<int, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>)
0.99% 694.43us 15 46.295us 45.184us 50.656us [CUDA memcpy DtoD]
0.87% 606.08us 16 37.879us 25.024us 218.94us void kernel_v_v<char, long, long, Equal>(int, char*, long*, long*)
0.81% 567.96us 6 94.660us 94.463us 94.783us cudapy::cudf::utils::cudautils::gpu_arange$241(__int64, __int64, __int64, Array<__int64, int=1, A, mutable, aligned>)
0.73% 513.47us 16 32.091us 25.984us 120.70us void _GLOBAL__N__56_tmpxft_00002643_00000000_7_reductions_compute_70_cpp1_ii_c1104e96::gpu_reduction_op<cudf::detail::wrapper<char, gdf_dtype=7>, cudf::detail::wrapper<char, gdf_dtype=7>, cudf::DeviceMin, cudf::reductions::IdentityLoader>(char const *, unsigned char const *, int, gdf_dtype=7*, cudf::detail::wrapper<char, gdf_dtype=7>, unsigned char const *, cudf::detail::wrapper<char, gdf_dtype=7>)
0.50% 351.71us 12 29.309us 22.048us 36.288us _ZN6thrust8cuda_cub4core13_kernel_agentINS0_12__merge_sort14PartitionAgentIPilZ14multi_col_sortIiEvPKPvPKPhS5_PammbPT_bbP11CUstream_stEUliiE1_EEbS5_S5_lmPlSI_liEEvT0_T1_T2_T3_T4_T5_T6_T7_T8_
0.22% 150.37us 1 150.37us 150.37us 150.37us cudapy::cudf::utils::cudautils::gpu_copy$242(Array<__int64, int=1, A, mutable, aligned>, Array<int, int=1, A, mutable, aligned>)
0.07% 48.256us 1 48.256us 48.256us 48.256us void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__tabulate::functor<int*, thrust::system::detail::generic::sequence_detail::sequence_functor<int>, long>, long>, thrust::cuda_cub::__tabulate::functor<int*, thrust::system::detail::generic::sequence_detail::sequence_functor<int>, long>, long>(int, thrust::system::detail::generic::sequence_detail::sequence_functor<int>)
0.00% 1.6640us 1 1.6640us 1.6640us 1.6640us void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<void*>, void*>, unsigned long>, thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<void*>, void*>, unsigned long>(thrust::device_ptr<void*>, void*)
0.00% 1.2480us 1 1.2480us 1.2480us 1.2480us void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<unsigned char*>, unsigned char*>, unsigned long>, thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<unsigned char*>, unsigned char*>, unsigned long>(thrust::device_ptr<unsigned char*>, unsigned char*)
0.00% 1.1200us 1 1.1200us 1.1200us 1.1200us void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<int>, int>, unsigned long>, thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<int>, int>, unsigned long>(thrust::device_ptr<int>, int)
==2368== Device "Tesla V100-SXM2-16GB (1)"
No kernels were profiled.
==2368== Device "Tesla V100-SXM2-16GB (2)"
No kernels were profiled.
==2368== Device "Tesla V100-SXM2-16GB (3)"
No kernels were profiled.
Sorry for long information - I see only GPU (0) is getting used. How do I ensure workload is spread across multiple gpu and is this correct way to verify? Thank you !
Regarding the status information, this indeed looks like a bug that we had not yet noticed, thanks for reporting.
Your code for GPU distribution is correct, but there is another bug (this time known and discussed in #32) when import cudf
(or import dask_cudf
, that internally also imports cudf
) is imported before the creation of LocalCUDACluster
. The simplest solution for your example is just to move those two imports inside __main__
and after Client()
, that may suffice for your use case. However, it isn't clear if this works for all possible pipelines, another way to work around that for now is to use the CLI for dask-scheduler
and dask-cuda-worker
to start those up manually.
Also please note that [df.compute() for df in dfs] is sequential. Perhaps you wanted dask.compute(*dfs)
On Mon, Jun 17, 2019 at 10:16 PM Peter Andreas Entschev < notifications@github.com> wrote:
Regarding the status information, this indeed looks like a bug that we had not yet noticed, thanks for reporting.
Your code for GPU distribution is correct, but there is another bug (this time known and discussed in #32 https://github.com/rapidsai/dask-cuda/issues/32) when import cudf (or import dask_cudf, that internally also imports cudf) is imported before the creation of LocalCUDACluster. The simplest solution for your example is just to move those two imports inside main and after Client(), that may suffice for your use case. However, it isn't clear if this works for all possible pipelines, another way to work around that for now is to use the CLI for dask-scheduler and dask-cuda-worker to start those up manually.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rapidsai/dask-cuda/issues/74?email_source=notifications&email_token=AACKZTEZ2U4OUKF26M6SY7DP27WIJA5CNFSM4HYWT7OKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODX4KJGQ#issuecomment-502834330, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTGBOCL6AYQ3BRBEALTP27WIJANCNFSM4HYWT7OA .
Thanks @pentschev @mrocklin for all the suggestions -
After recommended modification in example snippet -
if __name__ == '__main__':
#cluster = LocalCUDACluster(scheduler_port=12347,n_workers=2, threads_per_worker=1)
cluster = LocalCUDACluster()
print("cluster status ",cluster.status)
print("cluster infomarion ", cluster)
client = Client(cluster)
import cudf as gd
import dask_cudf as dgd
...
...
got = gd.concat(dask.compute(*delays))
nvidia-smi usages are showing that all 4 GPUs are getting used -
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 74287 C python 405MiB |
+-----------------------------------------------------------------------------+
Mon Jun 17 23:20:41 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.03 Driver Version: 418.40.03 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 |
| N/A 44C P0 67W / 300W | 1130MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000004:05:00.0 Off | 0 |
| N/A 45C P0 69W / 300W | 325MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000035:03:00.0 Off | 0 |
| N/A 42C P0 68W / 300W | 325MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000035:04:00.0 Off | 0 |
| N/A 48C P0 69W / 300W | 325MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 74287 C python 805MiB |
| 0 74316 C /opt/anaconda3/bin/python 315MiB |
| 1 74314 C /opt/anaconda3/bin/python 315MiB |
| 2 74313 C /opt/anaconda3/bin/python 315MiB |
| 3 74315 C /opt/anaconda3/bin/python 315MiB |
+-----------------------------------------------------------------------------+
But When I use - nvprof --print-summary-per-gpu --profile-child-processes python test_cluster2.py
the result is same as before -
==5018== Device "Tesla V100-SXM2-16GB (0)"
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 42.34% 41.962ms 183 229.30us 1.4710us 4.2174ms [CUDA memcpy DtoH]
41.68% 41.305ms 101 408.96us 1.0240us 6.8770ms [CUDA memcpy HtoD]
9.50% 9.4118ms 1 9.4118ms 9.4118ms 9.4118ms _ZN6thrust8cuda_cub4core13_kernel_agentINS0_12__merge_sort14BlockSortAgentIPiS5_lZ14multi_col_sortIiEvPKPvPKPhS5_PammbPT_bbP11CUstream_stEUliiE1_NS_6detail17integral_constantIbLb0EEESL_EEbS5_S5_lS5_S5_SI_EEvT0_T1_T2_T3_T4_T5_T6_
1.31% 1.2946ms 12 107.88us 102.75us 122.37us _ZN6thrust8cuda_cub4core13_kernel_agentINS0_12__merge_sort10MergeAgentIPiS5_lZ14multi_col_sortIiEvPKPvPKPhS5_PammbPT_bbP11CUstream_stEUliiE1_NS_6detail17integral_constantIbLb0EEEEEbS5_S5_lS5_S5_SI_PllEEvT0_T1_T2_T3_T4_T5_T6_T7_T8_
1.30% 1.2842ms 5 256.84us 249.44us 270.91us cudapy::cudf::utils::cudautils::gpu_gather$243(Array<__int64, int=1, A, mutable, aligned>, Array<int, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>)
0.71% 702.33us 15 46.822us 45.567us 50.720us [CUDA memcpy DtoD]
0.62% 618.85us 23 26.906us 1.1840us 219.30us void kernel_v_v<char, long, long, Equal>(int, char*, long*, long*)
0.61% 607.71us 23 26.422us 1.9520us 139.26us void _GLOBAL__N__56_tmpxft_00002643_00000000_7_reductions_compute_70_cpp1_ii_c1104e96::gpu_reduction_op<cudf::detail::wrapper<char, gdf_dtype=7>, cudf::detail::wrapper<char, gdf_dtype=7>, cudf::DeviceMin, cudf::reductions::IdentityLoader>(char const *, unsigned char const *, int, gdf_dtype=7*, cudf::detail::wrapper<char, gdf_dtype=7>, unsigned char const *, cudf::detail::wrapper<char, gdf_dtype=7>)
0.58% 571.77us 10 57.177us 1.2160us 95.423us cudapy::cudf::utils::cudautils::gpu_arange$241(__int64, __int64, __int64, Array<__int64, int=1, A, mutable, aligned>)
....
....
==5018== Device "Tesla V100-SXM2-16GB (1)"
No kernels were profiled.
==5018== Device "Tesla V100-SXM2-16GB (2)"
No kernels were profiled.
==5018== Device "Tesla V100-SXM2-16GB (3)"
No kernels were profiled.
I can see only GPU(0) is getting profiled and No kernels were profiled
for rest of the three GPUs in the system.
am I missing something on nvprof
usage side ? Do I need use any other option in nvprof
to profile all the GPU activities.
Thanks!
My apologies the delay in responding here.
After analyzing this issue a little further, indeed nvprof
doesn't report anything on GPUs other than 0, but watch nvidia-smi
I can see that there is GPU utilization for all GPUs on the machine.
Would you mind doing another test? What I suggest is that you run again your code and watch nvidia-smi
during its execution. The differences I noticed were that if import cudf
/import dask_cudf
happen at the top, I will see all GPUs consuming 11MB and 0% utilization for the entire execution time, whereas GPU 0 reaches over 10GB of consumption and GPU utilization goes up to 100% at times. If I move the imports to after printing cluster information, immediately after that happens, I see GPUs consuming 429MB, with the exception of GPU 0 consuming > 4GB (as it's populating df
), after some time I see memory and GPU utilization increasing on other GPUs, but they are really subtle as there's not much computation going on.
I believe the reason for nvprof
not reporting utilization on all GPUs is due to dask-cuda using CUDA_VISIBLE_DEVICES
environment variable to select which GPU is used by each process. Within the process, the GPU being utilized is seen as GPU 0 at all times, and I think nvprof
is using that index during report. I can't confirm if my assumption is correct yet, but I'll try to make a simple example and perhaps file a bug report to nvprof
if necessary.
Thanks @pentschev . That's a very interesting observation. I will try out the suggested test.
@pentschev : here is the test result -
If I add import
statement at the top -
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 |
| N/A 41C P0 52W / 300W | 2293MiB / 16130MiB | 22% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000004:05:00.0 Off | 0 |
| N/A 42C P0 39W / 300W | 10MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000035:03:00.0 Off | 0 |
| N/A 39C P0 38W / 300W | 10MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000035:04:00.0 Off | 0 |
| N/A 45C P0 41W / 300W | 10MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 70918 C python 399MiB |
| 0 70947 C /opt/anaconda3/bin/python 591MiB |
| 0 70948 C /opt/anaconda3/bin/python 447MiB |
| 0 70950 C /opt/anaconda3/bin/python 447MiB |
| 0 70952 C /opt/anaconda3/bin/python 399MiB |
+-----------------------------------------------------------------------------+
After client = Client(cluster)
step -
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 |
| N/A 42C P0 52W / 300W | 906MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000004:05:00.0 Off | 0 |
| N/A 44C P0 54W / 300W | 505MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000035:03:00.0 Off | 0 |
| N/A 41C P0 53W / 300W | 505MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000035:04:00.0 Off | 0 |
| N/A 46C P0 56W / 300W | 697MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 72371 C python 401MiB |
| 0 72393 C /opt/anaconda3/bin/python 495MiB |
| 1 72397 C /opt/anaconda3/bin/python 495MiB |
| 2 72395 C /opt/anaconda3/bin/python 495MiB |
| 3 72394 C /opt/anaconda3/bin/python 687MiB |
+-----------------------------------------------------------------------------+
just before client = Client(cluster)
steps -
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 |
| N/A 41C P0 52W / 300W | 1480MiB / 16130MiB | 7% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000004:05:00.0 Off | 0 |
| N/A 44C P0 54W / 300W | 505MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000035:03:00.0 Off | 0 |
| N/A 40C P0 53W / 300W | 505MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000035:04:00.0 Off | 0 |
| N/A 46C P0 55W / 300W | 505MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 79158 C python 879MiB |
| 0 79178 C /opt/anaconda3/bin/python 591MiB |
| 1 79176 C /opt/anaconda3/bin/python 495MiB |
| 2 79181 C /opt/anaconda3/bin/python 495MiB |
| 3 79177 C /opt/anaconda3/bin/python 495MiB |
+-----------------------------------------------------------------------------+
From the Processes table, it looks like importing cudf after LocalCUDACluster
seems to have worked, since we can see that there are python processes on all 4 GPUs.
I know this is not a great solution, but we still don't have one. Does this workaround solve your problem for now?
The issue with delay for workers to start and cluster to report them was also fixed in https://github.com/rapidsai/dask-cuda/pull/78.
From the Processes table, it looks like importing cudf after
LocalCUDACluster
seems to have worked, since we can see that there are python processes on all 4 GPUs.I know this is not a great solution, but we still don't have one. Does this workaround solve your problem for now?
Sure. Thanks for looking into. Any luck on nvprof
issue ?
Sure. Thanks for looking into. Any luck on
nvprof
issue ?
Unfortunately, I haven't had the chance yet, will try to do it next week.
I was trying out to create Multi-node cluster - So started scheduler in another node and tried to create worker using dask-cuda-worker <scheduler-ip>:8786
on a 4GPU
node.
From nvidia-smi
o/p is like below -
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 |
| N/A 44C P0 55W / 300W | 620MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000004:05:00.0 Off | 0 |
| N/A 46C P0 57W / 300W | 620MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000035:03:00.0 Off | 0 |
| N/A 42C P0 54W / 300W | 620MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000035:04:00.0 Off | 0 |
| N/A 48C P0 56W / 300W | 620MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 66586 C /opt/anaconda3/bin/python 305MiB |
| 0 66617 C /opt/anaconda3/bin/python 305MiB |
| 1 66586 C /opt/anaconda3/bin/python 305MiB |
| 1 66620 C /opt/anaconda3/bin/python 305MiB |
| 2 66586 C /opt/anaconda3/bin/python 305MiB |
| 2 66618 C /opt/anaconda3/bin/python 305MiB |
| 3 66586 C /opt/anaconda3/bin/python 305MiB |
| 3 66619 C /opt/anaconda3/bin/python 305MiB |
+-----------------------------------------------------------------------------+
$ ps -ef | grep 66586
1084 66586 46641 8 09:48 pts/39 00:00:02 /opt/anaconda3/bin/python /opt/anaconda3/bin/dask-cuda-worker 172.18.0.25:8786
1084 66591 66586 0 09:48 pts/39 00:00:00 [sh] <defunct>
1084 66610 66586 0 09:48 pts/39 00:00:00 /opt/anaconda3/bin/python -c from multiprocessing.semaphore_tracker import main;main(79)
1084 66615 66586 1 09:48 pts/39 00:00:00 /opt/anaconda3/bin/python -c from multiprocessing.forkserver import main; main(94, 108, ['distributed', 'pkg_resources'], **{'sys_path': ['/opt/anaconda3/bin', '/opt/anaconda3/lib/python36.zip', '/opt/anaconda3/lib/python3.6', '/opt/anaconda3/lib/python3.6/lib-dynload', '/opt/anaconda3/lib/python3.6/site-packages']})
pradghos 66675 54389 0 09:48 pts/44 00:00:00 grep --color=auto 66586
$ ps -ef | grep 66617
1084 66617 66615 3 09:48 pts/39 00:00:01 /opt/anaconda3/bin/python -c from multiprocessing.forkserver import main; main(94, 108, ['distributed', 'pkg_resources'], **{'sys_path': ['/opt/anaconda3/bin', '/opt/anaconda3/lib/python36.zip', '/opt/anaconda3/lib/python3.6', '/opt/anaconda3/lib/python3.6/lib-dynload', '/opt/anaconda3/lib/python3.6/site-packages']})
pradghos 66681 54389 0 09:48 pts/44 00:00:00 grep --color=auto 66617
66586
is started on each GPU and 66617
, 66620
, 66618
and 66619
are probably the worker started on each GPU.
Question here is - 1> is that expected to see process - 66586
( /opt/anaconda3/bin/python /opt/anaconda3/bin/dask-cuda-worker
) in every GPU and also process-id 66586
has allocated another 305MB
from each GPU which may not be required.
@pentschev : Any comment / suggestion on this behavior would help.
Thanks!
Question here is - 1> is that expected to see process -
66586
(/opt/anaconda3/bin/python /opt/anaconda3/bin/dask-cuda-worker
) in every GPU and also process-id66586
has allocated another305MB
from each GPU which may not be required.
It is not expected, there should be one process per GPU. Could you share the output of conda list
and, if you're installing dask/distributed/dask-cuda from source, the respective commit which you used for that? Could you also share the command you used to startup from the command line, or a minimal sample of the code used if you started directly from your python script?
Question here is - 1> is that expected to see process -
66586
(/opt/anaconda3/bin/python /opt/anaconda3/bin/dask-cuda-worker
) in every GPU and also process-id66586
has allocated another305MB
from each GPU which may not be required.It is not expected, there should be one process per GPU. Could you share the output of
conda list
and, if you're installing dask/distributed/dask-cuda from source, the respective commit which you used for that? Could you also share the command you used to startup from the command line, or a minimal sample of the code used if you started directly from your python script?
@pentschev : Sorry for the delayed response ! I have used dask-cuda 0.8.0
, dask 2.0.0
, distributed 2.0.1
during this recreate. - here is the steps - I have tried -
dask-scheduler
command from a node. dask-cuda-worker <scheduler-ip>:8786
on the another node (having GPUs). nvidia-smi
o/p in worker node. @pentschev: Please let me know if this is not the correct way to create multi node cluster(GPU enabled) using dask
and dask-cuda
. Thank you !
Sorry @pradghos, I totally missed your response here. This is strange indeed, I don't think you're doing anything incorrect. I'm going to do the following: close this issue (since the original question has been responded/fixed) and open two new issues for the still unresolved issues you reported here in https://github.com/rapidsai/dask-cuda/issues/74#issuecomment-502957116 and https://github.com/rapidsai/dask-cuda/issues/74#issuecomment-509687986. This way we can better track issues individually instead of end up missing them.
Regardless of that, sorry for not being so responsive here and thanks for reporting the issues, this is very helpful!
Hi,
I want to create local cuda dask cluster using LocalCUDACluster
python script is mentioned below -
When I am using python command prompt - it is working for me .
However, when I am trying to run python scripts - using
python test_cluster.py
Any pointers if I am missing something ? Thanks in advance !