rapidsai-community / notebooks-contrib

RAPIDS Community Notebooks
Apache License 2.0
515 stars 267 forks source link

E2E-taxi - Warnings #101

Open vilmara opened 5 years ago

vilmara commented 5 years ago

System: Docker image: rapidsai/rapidsai:0.8-cuda10.0-devel-ubuntu16.04-gcc5-py3.7 2 servers with 4xV100 each

hi all, I am testing the E2E-taxi notebook in multi node using dask-cuda (connected via docker over the net host), and I am getting these warnings:

-Run out-of-band function 'start_tracker'
-Event loop was unresponsive in Worker for 9.04s.  This is often caused by long-running GIL-holding functions or moving large chunks of datand instability.
-libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
-libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0

See below detailed tracelog on each node

Dask-scheduler distributed.worker - INFO - Run out-of-band function 'start_tracker'

dask-cuda-worker (running on the same machine as dask scheduler)

distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.core - INFO - Event loop was unresponsive in Worker for 271.59s.  This is often caused by long-running GIL-holding functions or moving large chunks of ds and instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 271.60s.  This is often caused by long-running GIL-holding functions or moving large chunks of ds and instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 271.60s.  This is often caused by long-running GIL-holding functions or moving large chunks of ds and instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 271.63s.  This is often caused by long-running GIL-holding functions or moving large chunks of ds and instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 64.24s.  This is often caused by long-running GIL-holding functions or moving large chunks of da and instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 9.05s.  This is often caused by long-running GIL-holding functions or moving large chunks of datand instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 9.04s.  This is often caused by long-running GIL-holding functions or moving large chunks of datand instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 9.04s.  This is often caused by long-running GIL-holding functions or moving large chunks of datand instability.

dask-cuda-worker (running on another machine)

distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.core - INFO - Event loop was unresponsive in Worker for 9.77s.  This is often caused by long-running GIL-holding functions or moving large chunks of datand instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 9.04s.  This is often caused by long-running GIL-holding functions or moving large chunks of datand instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 9.03s.  This is often caused by long-running GIL-holding functions or moving large chunks of datand instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 9.04s.  This is often caused by long-running GIL-holding functions or moving large chunks of datand instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 280.05s.  This is often caused by long-running GIL-holding functions or moving large chunks of ds and instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 280.47s.  This is often caused by long-running GIL-holding functions or moving large chunks of ds and instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 280.54s.  This is often caused by long-running GIL-holding functions or moving large chunks of ds and instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 280.69s.  This is often caused by long-running GIL-holding functions or moving large chunks of ds and instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 27.10s.  This is often caused by long-running GIL-holding functions or moving large chunks of da and instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 9.04s.  This is often caused by long-running GIL-holding functions or moving large chunks of datand instability.
distributed.core - INFO - Event loop was unresponsive in Worker for 9.04s.  This is often caused by long-running GIL-holding functions or moving large chunks of datand instability.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
taureandyernv commented 5 years ago

@vilmara , @randerzander, and his team are working on this and updating it to remove the GCP dependency

vilmara commented 5 years ago

hi @taureandyernv / @randerzander, I have run the notebook with the servers connected over infiniband and eliminated the warnins at the cuda workers; however, the messgae distributed.worker - INFO - Run out-of-band function 'start_tracker' still persists at the dask-scheduler node. Does it affect the performance or should I ignore it?