rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.42k stars 899 forks source link

[BUG] simple cudf program execution time jumps from seconds to over five minutes with vgpu driver #10507

Closed tmckayus closed 2 years ago

tmckayus commented 2 years ago

Describe the bug The execution time of a simple Python program that imports cudf and creates a dataframe will eventually increase to over 5 minutes when run multiple times on a system using the vGPU driver.

Steps/Code to reproduce bug

  1. Set up a system that uses the vGPU driver, do a fresh reboot to make sure it is in a good state.
  2. Install Python and cudf.
  3. Run the attached test.sh script to invoke bug_reproducer.py in a loop
  4. Eventually (after several hundred invocations in testing), the execution time will jump to some large value (in original testing it was always roughly 5 min and 24 seconds) and bug_reproducer.py will output a message indicating the run time.
  5. Once the execution time has increased, the behavior will persist until the system is rebooted.

Expected behavior Execution time should be minimal, and remain minimal (about 1 or 2 seconds)

Environment overview (please complete the following information)

Environment details Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Click here to see environment details

     **git***
     commit 54918d889b042fd4fa54e1fa75137ab719489dd4 (HEAD -> branch-22.04, origin/branch-22.04, origin/HEAD)
     Author: GALI PREM SAGAR 
     Date:   Wed Mar 23 19:58:25 2022 -0500

     Add `scipy` skip for a test (#10502)

     This PR skips `scipy` related code-path that pandas takes. This isn't a failure in CI, but cudf dev env doesn't require having a `scipy` package, hence adding an `importorskip`.

     Authors:
     - GALI PREM SAGAR (https://github.com/galipremsagar)

     Approvers:
     - Vyas Ramasubramani (https://github.com/vyasr)
     - Ashwin Srinath (https://github.com/shwina)

     URL: https://github.com/rapidsai/cudf/pull/10502
     **git submodules***

     ***OS Information***
     DISTRIB_ID=Ubuntu
     DISTRIB_RELEASE=20.04
     DISTRIB_CODENAME=focal
     DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
     NAME="Ubuntu"
     VERSION="20.04.2 LTS (Focal Fossa)"
     ID=ubuntu
     ID_LIKE=debian
     PRETTY_NAME="Ubuntu 20.04.2 LTS"
     VERSION_ID="20.04"
     HOME_URL="https://www.ubuntu.com/"
     SUPPORT_URL="https://help.ubuntu.com/"
     BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
     PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
     VERSION_CODENAME=focal
     UBUNTU_CODENAME=focal
     Linux nlp 5.4.0-105-generic #119-Ubuntu SMP Mon Mar 7 18:49:24 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

     ***GPU Information***
     Thu Mar 24 14:01:31 2022
     +-----------------------------------------------------------------------------+
     | NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
     |-------------------------------+----------------------+----------------------+
     | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
     | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
     |                               |                      |               MIG M. |
     |===============================+======================+======================|
     |   0  NVIDIA A30-24C      On   | 00000000:02:02.0 Off |                    0 |
     | N/A   N/A    P0    N/A /  N/A |   2192MiB / 24571MiB |      0%      Default |
     |                               |                      |             Disabled |
     +-------------------------------+----------------------+----------------------+

     +-----------------------------------------------------------------------------+
     | Processes:                                                                  |
     |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
     |        ID   ID                                                   Usage      |
     |=============================================================================|
     |  No running processes found                                                 |
     +-----------------------------------------------------------------------------+

     ***CPU***
     Architecture:                    x86_64
     CPU op-mode(s):                  32-bit, 64-bit
     Byte Order:                      Little Endian
     Address sizes:                   45 bits physical, 57 bits virtual
     CPU(s):                          16
     On-line CPU(s) list:             0-15
     Thread(s) per core:              1
     Core(s) per socket:              16
     Socket(s):                       1
     NUMA node(s):                    1
     Vendor ID:                       GenuineIntel
     CPU family:                      6
     Model:                           106
     Model name:                      Intel(R) Xeon(R) Gold 6354 CPU @ 3.00GHz
     Stepping:                        6
     CPU MHz:                         2992.968
     BogoMIPS:                        5985.93
     Hypervisor vendor:               VMware
     Virtualization type:             full
     L1d cache:                       768 KiB
     L1i cache:                       512 KiB
     L2 cache:                        20 MiB
     L3 cache:                        39 MiB
     NUMA node0 CPU(s):               0-15
     Vulnerability Itlb multihit:     KVM: Vulnerable
     Vulnerability L1tf:              Not affected
     Vulnerability Mds:               Not affected
     Vulnerability Meltdown:          Not affected
     Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
     Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
     Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
     Vulnerability Srbds:             Not affected
     Vulnerability Tsx async abort:   Not affected
     Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid md_clear flush_l1d arch_capabilities

     ***CMake***

     ***g++***
     /usr/bin/g++
     g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
     Copyright (C) 2019 Free Software Foundation, Inc.
     This is free software; see the source for copying conditions.  There is NO
     warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

     ***nvcc***

     ***Python***
     /home/tmckay/miniconda3/envs/cuda/bin/python
     Python 3.9.10

     ***Environment Variables***
     PATH                            : /home/tmckay/.local/bin:/home/tmckay/miniconda3/envs/cuda/bin:/home/tmckay/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/tmckay
     LD_LIBRARY_PATH                 :
     NUMBAPRO_NVVM                   :
     NUMBAPRO_LIBDEVICE              :
     CONDA_PREFIX                    : /home/tmckay/miniconda3/envs/cuda
     PYTHON_PATH                     :

     ***conda packages***
     /home/tmckay/miniconda3/condabin/conda
     # packages in environment at /home/tmckay/miniconda3/envs/cuda:
     #
     # Name                    Version                   Build  Channel
     _libgcc_mutex             0.1                 conda_forge    conda-forge
     _openmp_mutex             4.5                       1_gnu    conda-forge
     abseil-cpp                20210324.2           h9c3ff4c_0    conda-forge
     arrow-cpp                 6.0.1           py39hbe2a541_5_cuda    conda-forge
     arrow-cpp-proc            3.0.0                      cuda    conda-forge
     aws-c-auth                0.6.8                hadad3cd_1    conda-forge
     aws-c-cal                 0.5.12               h70efedd_7    conda-forge
     aws-c-common              0.6.17               h7f98852_0    conda-forge
     aws-c-compression         0.2.14               h7c7754b_7    conda-forge
     aws-c-event-stream        0.2.7               hd2be095_32    conda-forge
     aws-c-http                0.6.10               h416565a_3    conda-forge
     aws-c-io                  0.10.14              he836878_0    conda-forge
     aws-c-mqtt                0.7.10               h885097b_0    conda-forge
     aws-c-s3                  0.1.29               h8d70ed6_0    conda-forge
     aws-c-sdkutils            0.1.1                h7c7754b_4    conda-forge
     aws-checksums             0.1.12               h7c7754b_6    conda-forge
     aws-crt-cpp               0.17.10              h6ab17b9_5    conda-forge
     aws-sdk-cpp               1.9.160              h36ff4c5_0    conda-forge
     blas                      1.0                         mkl
     bottleneck                1.3.2            py39hdd57654_1
     brotlipy                  0.7.0           py39h27cfd23_1003
     bzip2                     1.0.8                h7f98852_4    conda-forge
     c-ares                    1.18.1               h7f98852_0    conda-forge
     ca-certificates           2022.2.1             h06a4308_0
     cachetools                5.0.0              pyhd8ed1ab_0    conda-forge
     certifi                   2021.10.8        py39h06a4308_2
     cffi                      1.15.0           py39hd667e15_1
     charset-normalizer        2.0.4              pyhd3eb1b0_0
     cryptography              36.0.0           py39h9ce1e76_0
     cuda-python               11.6.1           py39h3fd9d12_0    nvidia
     cudatoolkit               11.6.0               habf752d_9    nvidia
     cudf                      22.04.00a220317 cuda_11_py39_g22a9f350eb_267    rapidsai-nightly
     cupy                      10.2.0           py39hc3c280e_0    conda-forge
     dlpack                    0.5                  h9c3ff4c_0    conda-forge
     fastavro                  1.4.10           py39hb9d737c_0    conda-forge
     fastrlock                 0.8              py39he80948d_1    conda-forge
     fsspec                    2022.2.0           pyhd8ed1ab_0    conda-forge
     gflags                    2.2.2             he1b5a44_1004    conda-forge
     glog                      0.5.0                h48cff8f_0    conda-forge
     grpc-cpp                  1.42.0               ha1441d3_1    conda-forge
     idna                      3.3                pyhd3eb1b0_0
     intel-openmp              2021.4.0          h06a4308_3561
     keyutils                  1.6.1                h166bdaf_0    conda-forge
     krb5                      1.19.3               h3790be6_0    conda-forge
     ld_impl_linux-64          2.36.1               hea4e1c9_2    conda-forge
     libbrotlicommon           1.0.9                h7f98852_6    conda-forge
     libbrotlidec              1.0.9                h7f98852_6    conda-forge
     libbrotlienc              1.0.9                h7f98852_6    conda-forge
     libcudf                   22.04.00a220317 cuda11_g22a9f350eb_267    rapidsai-nightly
     libcurl                   7.82.0               h7bff187_0    conda-forge
     libedit                   3.1.20191231         he28a2e2_2    conda-forge
     libev                     4.33                 h516909a_1    conda-forge
     libevent                  2.1.10               h9b69904_4    conda-forge
     libffi                    3.4.2                h7f98852_5    conda-forge
     libgcc-ng                 11.2.0              h1d223b6_14    conda-forge
     libgfortran-ng            7.5.0               ha8ba4b0_17
     libgfortran4              7.5.0               ha8ba4b0_17
     libgomp                   11.2.0              h1d223b6_14    conda-forge
     libllvm11                 11.1.0               hf817b99_3    conda-forge
     libnghttp2                1.47.0               h727a467_0    conda-forge
     libnsl                    2.0.0                h7f98852_0    conda-forge
     libprotobuf               3.19.4               h780b84a_0    conda-forge
     librmm                    22.04.00a220318 cuda11_g374d551_41    rapidsai-nightly
     libssh2                   1.10.0               ha56f1ee_2    conda-forge
     libstdcxx-ng              11.2.0              he4da1e4_14    conda-forge
     libthrift                 0.15.0               he6d91bd_1    conda-forge
     libutf8proc               2.7.0                h7f98852_0    conda-forge
     libuuid                   2.32.1            h7f98852_1000    conda-forge
     libzlib                   1.2.11            h36c2ea0_1013    conda-forge
     llvmlite                  0.38.0           py39h1bbdace_0    conda-forge
     lz4-c                     1.9.3                h9c3ff4c_1    conda-forge
     mkl                       2021.4.0           h06a4308_640
     mkl-service               2.4.0            py39h7f8727e_0
     mkl_fft                   1.3.1            py39hd3c417c_0
     mkl_random                1.2.2            py39h51133e4_0
     ncurses                   6.3                  h9c3ff4c_0    conda-forge
     numba                     0.55.1           py39h56b8d98_0    conda-forge
     numexpr                   2.8.1            py39h6abb31d_0
     numpy                     1.21.2           py39h20f2e39_0
     numpy-base                1.21.2           py39h79a1101_0
     nvtx                      0.2.3            py39h3811e60_1    conda-forge
     openssl                   1.1.1n               h7f8727e_0
     orc                       1.7.1                h1be678f_1    conda-forge
     packaging                 21.3               pyhd8ed1ab_0    conda-forge
     pandas                    1.3.4            py39h8c16a72_0
     parquet-cpp               1.5.1                         2    conda-forge
     pip                       22.0.4             pyhd8ed1ab_0    conda-forge
     protobuf                  3.19.4           py39he80948d_0    conda-forge
     ptxcompiler               0.2.0            py39h107f55c_0    rapidsai
     pyarrow                   6.0.1           py39h1ed2e5d_5_cuda    conda-forge
     pycparser                 2.21               pyhd3eb1b0_0
     pyopenssl                 22.0.0             pyhd3eb1b0_0
     pyparsing                 3.0.7              pyhd8ed1ab_0    conda-forge
     pysocks                   1.7.1            py39h06a4308_0
     python                    3.9.10          h85951f9_2_cpython    conda-forge
     python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
     python_abi                3.9                      2_cp39    conda-forge
     pytz                      2021.3             pyhd8ed1ab_0    conda-forge
     re2                       2021.11.01           h9c3ff4c_0    conda-forge
     readline                  8.1                  h46c0cb4_0    conda-forge
     requests                  2.27.1             pyhd3eb1b0_0
     rmm                       22.04.00a220318 cuda11_py39_g374d551_41    rapidsai-nightly
     s2n                       1.3.0                h9b69904_0    conda-forge
     scipy                     1.7.3            py39hc147768_0
     setuptools                59.8.0           py39hf3d152e_0    conda-forge
     six                       1.16.0             pyh6c4a22f_0    conda-forge
     snappy                    1.1.8                he1b5a44_3    conda-forge
     spdlog                    1.8.5                h4bd325d_1    conda-forge
     sqlite                    3.37.1               h4ff8645_0    conda-forge
     tk                        8.6.12               h27826a3_0    conda-forge
     typing_extensions         4.1.1              pyha770c72_0    conda-forge
     tzdata                    2022a                h191b570_0    conda-forge
     urllib3                   1.26.8             pyhd3eb1b0_0
     wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
     xz                        5.2.5                h516909a_1    conda-forge
     zlib                      1.2.11            h36c2ea0_1013    conda-forge
     zstd                      1.5.2                ha95c52a_0    conda-forge

Additional context The script used to install the vgpu driver was ngc resource nvaie/vgpu_guest_driver:470.63.01-ubuntu20.04

reproducer.zip

tmckayus commented 2 years ago

I tried the same script on several systems with physcial gpus and the issue did not reproduce.

tmckayus commented 2 years ago

Just a note, made the timing more granular and the increase in time is in the dataframe creation line, not the import.

df = cudf.DataFrame({"cat": "dog"})

shwina commented 2 years ago

Thanks @tmckayus .

That's interesting. I wonder if the CUDA context initialization is what takes so long. Just to narrow the problem a bit, do you see the same behavior when trying to create a cupy array rather than a cuDF dataframe?

tmckayus commented 2 years ago

@shwina thanks I will try that and post back

tmckayus commented 2 years ago

@shwina here is an update

I discovered that on the test system (using A30-24C vGPU) there is a time component to the failure. Empirically, it seems that the system must be up for at least 20 minutes in order for the failure to be observed. After 20 minutes, it takes about 30 creations in a loop (either a cupy array or a cudf dataframe) to get the system in a state where the increased time can be observed. Experimenting with a 15 minute interval instead of 20 minutes yielded a case where ~300 iterations with a 1 second sleep were needed to observe the same increased time (and of course that is an additional 5 minutes, hence consistent with the 20 minute sleep time).

I've attached a new zipfile with updated reproducers. Description of the file contents:

bug_cudf_reproducer.py -- simple Python program that creates a dataframe in a loop and reports the import time for cudf (doesn't appear to ever be an issue) and time for the creation. An optional iteration count can be passed, default is loop until the failure threshold is crossed 3 times. The failure threshold is 1 second for a dataframe creation.

bug_cupy_reproducer.py -- simple Python program that creates a cupy array in a loop and reports the import time for cupy (doesn't appear to ever be an issue) and time for the creation. An optional iteration count can be passed, default is loop until the failure threshold is crossed 3 times. The failure threshold is 200 milliseconds for a cupy array creation.

time_test_cudf.sh -- bash script that runs bug_cupy_reproducer.py and bug_cudf_reproducer.py each 4 times to establish a baseline, then sleeps for 20 minutes and runs bug_cudf_reproducer.py until failure. Finally, it runs bug_cupy_reproducer.py and bug_cudf_reproducer.py again each 4 times to show that the degraded time is consistent across processes, and also that there is a very large initial delay in the first iteration after a process switch. These effects will persist until the system is rebooted.

time_test_cupy.sh -- bash script that runs bug_cupy_reproducer.py and bug_cudf_reproducer.py each 4 times to establish a baseline, then sleeps for 20 minutes and runs bug_cupy_reproducer.py until failure. Finally, it runs bug_cudf_reproducer.py and bug_cupy_reproducer.py again each 4 times to show that the degraded time is consistent across processes, and also that there is a very large initial delay in the first iteration after a process switch. These effects will persist until the system is rebooted. This script illustrates that it doesn't matter whether cupy or cudf is used in the main iteration after the sleep, either one will induce a failure that is consistent across both programs.

time_test_cudf.log -- output from a time_test_cudf.sh run. Note in particular:

time_test_cudf_skip_initial.log -- a second run, with calls to establish initial baseline times removed and beginning with the 20m delay. This was done on a newly rebooted system to demonstrate that it is not the initial calls to the cudf/cupy libraries that catalyzed the error, it happens shortly after the 20m delay (you know, Heisenberg :) ). The rest of the log is consistent with the first log above.

time_test_cupy.log -- essentially the same scenario, but using cupy as the main iteration instead of cudf to incur the degradation in time to illustrate that the mechanism doesn't matter. Note that the behaviors are consistent: it takes about 30 iterations of cupy array creation to notice the time degradation after a 20 minute sleep, and then the rest of the degraded times for execution and process switches are consistent.

cudf_cupy_reproducers.zip

tmckayus commented 2 years ago

This turned out to be a license issue -- if the vgpu is not correctly licensed, performance degrades after 20 minutes as noted here :)

https://docs.nvidia.com/grid/latest/grid-licensing-user-guide/index.html#software-enforcement-grid-licensing