[BUG] Long import times

mrocklin commented 5 years ago

Describe the bug

It takes a few seconds to import cudf today

Steps/Code to reproduce bug

In [1]: import numpy, pandas

In [2]: %time import cudf
CPU times: user 432 ms, sys: 132 ms, total: 564 ms
Wall time: 2.44 s

``` **git*** commit ab3f45857f641548f6d64d977908075d63c193bf (HEAD -> repr-html, mrocklin/repr-html) Author: Matthew Rocklin Date: Thu Jan 3 17:35:10 2019 -0800 Test repr for both large and small dataframes ***OS Information*** DGX_NAME="DGX Server" DGX_PRETTY_NAME="NVIDIA DGX Server" DGX_SWBUILD_DATE="2018-03-20" DGX_SWBUILD_VERSION="3.1.6" DGX_COMMIT_ID="1b0f58ecbf989820ce745a9e4836e1de5eea6cfd" DGX_SERIAL_NUMBER=QTFCOU8310024 DGX_OTA_VERSION="3.1.7" DGX_OTA_DATE="Thu Sep 27 20:07:53 PDT 2018" DISTRIB_ID=Ubuntu DISTRIB_RELEASE=16.04 DISTRIB_CODENAME=xenial DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS" NAME="Ubuntu" VERSION="16.04.5 LTS (Xenial Xerus)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 16.04.5 LTS" VERSION_ID="16.04" HOME_URL="http://www.ubuntu.com/" SUPPORT_URL="http://help.ubuntu.com/" BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/" VERSION_CODENAME=xenial UBUNTU_CODENAME=xenial Linux dgx16 4.4.0-135-generic #161-Ubuntu SMP Mon Aug 27 10:45:01 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux ***GPU Information*** Thu Jan 3 18:42:10 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 396.44 Driver Version: 396.44 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 | | N/A 42C P0 59W / 300W | 1085MiB / 32510MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:07:00.0 Off | 0 | | N/A 43C P0 46W / 300W | 11MiB / 32510MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:0A:00.0 Off | 0 | | N/A 42C P0 46W / 300W | 11MiB / 32510MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:0B:00.0 Off | 0 | | N/A 41C P0 47W / 300W | 11MiB / 32510MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 4 Tesla V100-SXM2... On | 00000000:85:00.0 Off | 0 | | N/A 42C P0 46W / 300W | 11MiB / 32510MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 5 Tesla V100-SXM2... On | 00000000:86:00.0 Off | 0 | | N/A 42C P0 44W / 300W | 11MiB / 32510MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 6 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 | | N/A 44C P0 45W / 300W | 11MiB / 32510MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 7 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 | | N/A 41C P0 44W / 300W | 11MiB / 32510MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 64394 C ...klin/miniconda/envs/cudf_dev/bin/python 644MiB | | 0 65545 C ...klin/miniconda/envs/cudf_dev/bin/python 430MiB | +-----------------------------------------------------------------------------+ ***CPU*** Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 80 On-line CPU(s) list: 0-79 Thread(s) per core: 2 Core(s) per socket: 20 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz Stepping: 1 CPU MHz: 2030.789 CPU max MHz: 3600.0000 CPU min MHz: 1200.0000 BogoMIPS: 4391.76 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 51200K NUMA node0 CPU(s): 0-19,40-59 NUMA node1 CPU(s): 20-39,60-79 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt ssbd ibrs ibpb stibp kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts flush_l1d ***CMake*** /home/nfs/mrocklin/miniconda/envs/cudf_dev/bin/cmake cmake version 3.13.2 CMake suite maintained and supported by Kitware (kitware.com/cmake). ***g++*** /usr/bin/g++ g++ (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609 Copyright (C) 2015 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ***nvcc*** ***Python*** /home/nfs/mrocklin/miniconda/envs/cudf_dev/bin/python Python 3.5.5 ***Environment Variables*** PATH : /home/nfs/mrocklin/miniconda/envs/cudf_dev/bin:/home/nfs/mrocklin/miniconda/bin:/home/nfs/mrocklin/miniconda/bin:/home/nfs/mrocklin/miniconda/bin:/home/nfs/mrocklin/bin:/home/nfs/mrocklin/.local/bin:/home/nfs/mrocklin/miniconda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games LD_LIBRARY_PATH : NUMBAPRO_NVVM : NUMBAPRO_LIBDEVICE : CONDA_PREFIX : /home/nfs/mrocklin/miniconda/envs/cudf_dev PYTHON_PATH : ***conda packages*** /home/nfs/mrocklin/miniconda/bin/conda # packages in environment at /home/nfs/mrocklin/miniconda/envs/cudf_dev: # # Name Version Build Channel alabaster 0.7.12 py_0 conda-forge arrow-cpp 0.10.0 py35h70250a7_0 conda-forge asn1crypto 0.24.0 py35_3 conda-forge atomicwrites 1.2.1 py_0 conda-forge attrs 18.2.0 py_0 conda-forge babel 2.6.0 py_1 conda-forge backcall 0.1.0 py_0 conda-forge blas 1.0 mkl bleach 3.0.2 py_1 conda-forge bokeh 0.13.0 py35_0 boost 1.67.0 py35h3e44d54_0 conda-forge boost-cpp 1.67.0 h3a22d5f_0 conda-forge bzip2 1.0.6 h470a237_2 conda-forge ca-certificates 2018.03.07 0 certifi 2018.8.24 py35_1 cffi 1.11.5 py35h5e8e0c9_1 conda-forge chardet 3.0.4 py35_3 conda-forge click 7.0 py_0 conda-forge cloudpickle 0.6.1 py_0 conda-forge cmake 3.13.2 h011004d_0 conda-forge commonmark 0.5.4 py_2 conda-forge cryptography 2.3.1 py35hdffb7b8_0 conda-forge cryptography-vectors 2.3.1 py35_0 conda-forge cudatoolkit 9.2 0 cudf 0.4.0+385.g81a187d curl 7.63.0 h74213dd_0 conda-forge cython 0.28.5 py35hfc679d8_0 conda-forge cytoolz 0.9.0.1 py35h470a237_0 conda-forge dask 1.0.0+51.g2ca205b dask-core 1.0.0 py_0 conda-forge dask-cuda 0.0.0 dask-cudf 0.0.1+222.g10a9f90.dirty decorator 4.3.0 py_0 conda-forge distributed 1.23.2 py35_1 conda-forge distributed 1.25.1+10.ga0d0ed2 docutils 0.14 py35_1 conda-forge entrypoints 0.2.3 py35_2 conda-forge expat 2.2.5 hfc679d8_2 conda-forge future 0.16.0 py35_2 conda-forge gmp 6.1.2 hfc679d8_0 conda-forge heapdict 1.0.0 py35_1000 conda-forge icu 58.2 hfc679d8_0 conda-forge idna 2.7 py35_2 conda-forge imagesize 1.1.0 py_0 conda-forge intel-openmp 2019.1 144 ipykernel 5.1.0 pyh24bf2e0_0 conda-forge ipython 7.0.1 py35h24bf2e0_0 conda-forge ipython_genutils 0.2.0 py_1 conda-forge jedi 0.12.1 py35_0 conda-forge jinja2 2.10 py_1 conda-forge jsonschema 2.6.0 py35_2 conda-forge jupyter_client 5.2.4 py_0 conda-forge jupyter_core 4.4.0 py_0 conda-forge krb5 1.16.2 hbb41f41_0 conda-forge libcurl 7.63.0 hbdb9355_0 conda-forge libedit 3.1.20170329 haf1bffa_1 conda-forge libffi 3.2.1 hfc679d8_5 conda-forge libgcc-ng 8.2.0 hdf63c60_1 libgdf-cffi 0.4.0 libgfortran-ng 7.2.0 hdf63c60_3 conda-forge librmm-cffi 0.4.0 libsodium 1.0.16 h470a237_1 conda-forge libssh2 1.8.0 h5b517e9_3 conda-forge libstdcxx-ng 8.2.0 hdf63c60_1 libuv 1.24.1 h470a237_0 conda-forge llvmlite 0.27.0 py35hf484d3e_0 numba Markdown 2.6.11 markupsafe 1.0 py35h470a237_1 conda-forge mistune 0.8.3 py35h470a237_2 conda-forge mkl 2018.0.3 1 mkl_fft 1.0.9 py35_0 conda-forge mkl_random 1.0.1 py35_0 conda-forge more-itertools 4.3.0 py35_0 conda-forge msgpack-python 0.5.6 py35h2d50403_3 conda-forge nbconvert 5.3.1 py_1 conda-forge nbformat 4.4.0 py_1 conda-forge ncurses 6.1 hfc679d8_2 conda-forge notebook 5.7.0 py35_0 conda-forge numba 0.42.0 np115py35hf484d3e_0 numba numpy 1.15.2 py35h1d66e8a_0 numpy-base 1.15.2 py35h81de0dd_0 numpydoc 0.8.0 py_1 conda-forge nvstrings 0.1.0 cuda9.2_py35_0 nvidia openssl 1.0.2p h14c3975_0 packaging 18.0 py_0 conda-forge pandas 0.20.3 py35_1 conda-forge pandoc 2.5 0 conda-forge pandocfilters 1.4.2 py_1 conda-forge parquet-cpp 1.5.0.pre h83d4a3d_0 conda-forge parso 0.3.1 py_0 conda-forge pathlib2 2.3.2 py35_0 conda-forge pexpect 4.6.0 py35_0 conda-forge pickleshare 0.7.5 py35_0 conda-forge pip 18.0 py35_1001 conda-forge pluggy 0.8.0 py_0 conda-forge prometheus_client 0.5.0 py_0 conda-forge prompt_toolkit 2.0.7 py_0 conda-forge psutil 5.4.7 py35h470a237_1 conda-forge ptyprocess 0.6.0 py35_1000 conda-forge py 1.7.0 py_0 conda-forge pyarrow 0.10.0 py35hfc679d8_0 conda-forge pycparser 2.19 py_0 conda-forge pygments 2.3.1 py_0 conda-forge pyopenssl 18.0.0 py35_0 conda-forge pyparsing 2.3.0 py_0 conda-forge pysocks 1.6.8 py35_2 conda-forge pytest 4.0.2 pytest 3.8.1 py35_0 python 3.5.5 h5001a0f_2 conda-forge python-dateutil 2.7.5 py_0 conda-forge pytz 2018.7 py_0 conda-forge pyyaml 3.13 py35h470a237_1 conda-forge pyzmq 17.1.2 py35hae99301_0 conda-forge readline 7.0 haf1bffa_1 conda-forge recommonmark 0.4.0 py_2 conda-forge requests 2.19.1 py35_1 conda-forge rhash 1.3.6 h470a237_1 conda-forge send2trash 1.5.0 py_0 conda-forge setuptools 40.4.3 py35_0 conda-forge simplegeneric 0.8.1 py_1 conda-forge six 1.11.0 py35_1 conda-forge snowballstemmer 1.2.1 py_1 conda-forge sortedcontainers 2.1.0 py_0 conda-forge sphinx 1.8.1 py35_0 conda-forge sphinx-markdown-tables 0.0.9 sphinx_rtd_theme 0.4.2 py_0 conda-forge sphinxcontrib-websupport 1.1.0 py_1 conda-forge sqlite 3.26.0 hb1c47c0_0 conda-forge tblib 1.3.2 py_1 conda-forge terminado 0.8.1 py35_1 conda-forge testpath 0.3.1 py35_1 conda-forge tk 8.6.9 ha92aebf_0 conda-forge toolz 0.9.0 py_1 conda-forge tornado 5.1.1 py35h470a237_0 conda-forge traitlets 4.3.2 py35_0 conda-forge urllib3 1.23 py35_1 conda-forge wcwidth 0.1.7 py_1 conda-forge webencodings 0.5.1 py_1 conda-forge wheel 0.32.0 py35_1000 conda-forge xz 5.2.4 h470a237_1 conda-forge yaml 0.1.7 h470a237_1 conda-forge zeromq 4.2.5 hfc679d8_6 conda-forge zict 0.1.3 py_0 conda-forge zlib 1.2.11 h470a237_3 conda-forge ```

mrocklin commented 5 years ago

Increasing

In [1]: %time import cudf
CPU times: user 556 ms, sys: 204 ms, total: 760 ms
Wall time: 3.39 s

mrocklin commented 5 years ago

Looking at the profile it looks like we're doing a lot of odd things at import time

         367424 function calls (356126 primitive calls) in 3.046 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      572    0.610    0.001    0.621    0.001 <frozen importlib._bootstrap_external>:914(get_data)
     3250    0.605    0.000    0.605    0.000 {built-in method posix.stat}
        3    0.573    0.191    0.573    0.191 {method 'read' of '_io.BufferedReader' objects}
      599    0.296    0.000    0.296    0.000 {built-in method io.open}
      597    0.256    0.000    0.256    0.000 {method 'close' of '_io.BufferedReader' objects}
    74/73    0.079    0.001    0.082    0.001 {built-in method _imp.create_dynamic}
      572    0.076    0.000    0.076    0.000 {built-in method marshal.loads}
        1    0.040    0.040    0.041    0.041 nvvm.py:106(__new__)
       82    0.039    0.000    0.039    0.000 {built-in method posix.listdir}
2000/1992    0.028    0.000    0.075    0.000 {built-in method builtins.__build_class__}
      547    0.019    0.000    0.019    0.000 ffi.py:106(__call__)
        2    0.017    0.008    0.017    0.008 {method 'dlopen' of 'CompiledFFI' objects}
    74/59    0.015    0.000    0.901    0.015 {built-in method _imp.exec_dynamic}
      921    0.015    0.000    0.015    0.000 {method 'sub' of 're.Pattern' objects}
      613    0.013    0.000    0.013    0.000 {method 'findall' of 're.Pattern' objects}
      572    0.011    0.000    0.011    0.000 {method 'read' of '_io.FileIO' objects}
     1093    0.010    0.000    0.400    0.000 <frozen importlib._bootstrap_external>:1356(find_spec)
        7    0.009    0.001    0.009    0.001 {built-in method posix.read}
        4    0.009    0.002    0.009    0.002 {built-in method _posixsubprocess.fork_exec}
      572    0.009    0.000    0.732    0.001 <frozen importlib._bootstrap_external>:793(get_code)
        1    0.008    0.008    0.008    0.008 {method 'dot' of 'numpy.ndarray' objects}
      211    0.008    0.000    0.008    0.000 sre_compile.py:276(_optimize_charset)
      820    0.007    0.000    0.421    0.001 <frozen importlib._bootstrap>:882(_find_spec)
      592    0.006    0.000    0.586    0.001 __init__.py:78(open_resource)
        4    0.006    0.001    0.006    0.001 {built-in method _ctypes.dlopen}
     8994    0.006    0.000    0.006    0.000 {built-in method builtins.hasattr}

brandon-b-miller commented 4 years ago

Profiled this a bit today and on my end it seems like the issue is inside cudf._cuda.gpu.getDeviceCount().

jrhemstad commented 4 years ago

Profiled this a bit today and on my end it seems like the issue is inside cudf._cuda.gpu.getDeviceCount().

I doubt that problem is with that function. My guess is that's just the first function that invokes a CUDA runtime API which initializes the context and other first time setup stuff.

kkraus14 commented 4 years ago

Profiled this a bit today and on my end it seems like the issue is inside cudf._cuda.gpu.getDeviceCount().

Could you dump the profile here by any chance?

brandon-b-miller commented 4 years ago

Here's the top piece of the profile.

         785450 function calls (763821 primitive calls) in 5.156 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    3.047    3.047    3.047    3.047 {cudf._cuda.gpu.getDeviceCount}
      120    0.358    0.003    0.358    0.003 {method 'read' of '_io.BufferedReader' objects}
      945    0.293    0.000    0.307    0.000 <frozen importlib._bootstrap_external>:914(get_data)
     5895    0.280    0.000    0.280    0.000 {built-in method posix.stat}
  204/202    0.143    0.001    0.146    0.001 {built-in method _imp.create_dynamic}
      945    0.109    0.000    0.109    0.000 {built-in method marshal.loads}
      506    0.089    0.000    0.089    0.000 {built-in method posix.listdir}
  204/130    0.045    0.000    0.161    0.001 {built-in method _imp.exec_dynamic}
2851/2818    0.038    0.000    0.133    0.000 {built-in method builtins.__build_class__}
        4    0.025    0.006    0.025    0.006 {built-in method _ctypes.dlopen}
      518    0.024    0.000    0.024    0.000 {built-in method builtins.compile}
     1826    0.016    0.000    0.312    0.000 <frozen importlib._bootstrap_external>:1356(find_spec)
      561    0.015    0.000    0.015    0.000 ffi.py:112(__call__)
      132    0.015    0.000    0.015    0.000 {built-in method io.open}
     1145    0.014    0.000    0.014    0.000 {method 'sub' of 're.Pattern' objects}
      945    0.014    0.000    0.014    0.000 {method 'read' of '_io.FileIO' objects}
        8    0.013    0.002    0.013    0.002 {built-in method posix.read}
  805/193    0.013    0.000    0.035    0.000 sre_parse.py:469(_parse)
        5    0.012    0.002    0.012    0.002 {built-in method _posixsubprocess.fork_exec}
      945    0.011    0.000    0.460    0.000 <frozen importlib._bootstrap_external>:793(get_code)
      614    0.010    0.000    0.012    0.000 sre_compile.py:276(_optimize_charset)
      501    0.010    0.000    0.010    0.000 {method 'findall' of 're.Pattern' objects}
     1215    0.009    0.000    0.333    0.000 <frozen importlib._bootstrap>:882(_find_spec)
16074/16069    0.008    0.000    0.008    0.000 {built-in method builtins.hasattr}
    10147    0.008    0.000    0.020    0.000 <frozen importlib._bootstrap_external>:56(_path_join)
    18447    0.008    0.000    0.009    0.000 {built-in method builtins.getattr}
   1343/1    0.007    0.000    5.158    5.158 {built-in method builtins.exec}
     1578    0.007    0.000    0.021    0.000 version.py:198(__init__)
      114    0.007    0.000    0.021    0.000 __init__.py:1617(_get)
     3039    0.007    0.000    0.010    0.000 <frozen importlib._bootstrap>:157(_get_module_lock)
        5    0.007    0.001    0.007    0.001 {built-in method posix.waitpid}
        3    0.007    0.002    0.008    0.003 six.py:1(<module>)
    51125    0.007    0.000    0.008    0.000 {built-in method builtins.isinstance}
     1890    0.007    0.000    0.015    0.000 <frozen importlib._bootstrap_external>:271(cache_from_source)
15643/15615    0.007    0.000    0.011    0.000 abstract.py:121(__hash__)
    10147    0.007    0.000    0.009    0.000 <frozen importlib._bootstrap_external>:58(<listcomp>)
     2850    0.006    0.000    0.006    0.000 {built-in method __new__ of type object at 0x55ed34c9f240}
      349    0.006    0.000    0.007    0.000 templates.py:614(make_overload_template)
   1239/1    0.006    0.000    5.158    5.158 <frozen importlib._bootstrap>:978(_find_and_load)
     3039    0.006    0.000    0.006    0.000 <frozen importlib._bootstrap>:78(acquire)
   1173/1    0.006    0.000    5.157    5.157 <frozen importlib._bootstrap>:663(_load_unlocked)
    18919    0.006    0.000    0.006    0.000 sre_parse.py:233(__next)
     1161    0.005    0.000    0.026    0.000 <frozen importlib._bootstrap>:504(_init_module_attrs)
 1337/180    0.005    0.000    0.018    0.000 sre_compile.py:71(_compile)
   1239/1    0.005    0.000    5.158    5.158 <frozen importlib._bootstrap>:948(_find_and_load_unlocked)
14699/14694    0.005    0.000    0.006    0.000 {method 'join' of 'str' objects}
       40    0.005    0.000    0.023    0.001 castgraph.py:94(propagate)
     3039    0.005    0.000    0.005    0.000 <frozen importlib._bootstrap>:103(release)
     1149    0.005    0.000    0.006    0.000 <frozen importlib._bootstrap_external>:574(spec_from_file_location)
 6132/741    0.004    0.000    1.570    0.002 <frozen importlib._bootstrap>:1009(_handle_fromlist)
    31763    0.004    0.000    0.004    0.000 {method 'startswith' of 'str' objects}
    10345    0.004    0.000    0.004    0.000 <frozen importlib._bootstrap>:222(_verbose_message)
     1208    0.004    0.000    0.317    0.000 <frozen importlib._bootstrap_external>:1240(_get_spec)
    16689    0.004    0.000    0.009    0.000 sre_parse.py:254(get)
       42    0.004    0.000    0.008    0.000 enum.py:135(__new__)
       77    0.004    0.000    0.009    0.000 __init__.py:316(namedtuple)
      790    0.003    0.000    0.004    0.000 version.py:343(_cmpkey)
     1607    0.003    0.000    0.003    0.000 {method 'search' of 're.Pattern' objects}
      945    0.003    0.000    0.005    0.000 <frozen importlib._bootstrap_external>:438(_classify_pyc)
    10024    0.003    0.000    0.003    0.000 {method 'rpartition' of 'str' objects}
    945/1    0.003    0.000    5.157    5.157 <frozen importlib._bootstrap_external>:722(exec_module)
      945    0.003    0.000    0.115    0.000 <frozen importlib._bootstrap_external>:523(_compile_bytecode)
     2487    0.003    0.000    0.227    0.000 <frozen importlib._bootstrap_external>:84(_path_is_mode_type)
    13657    0.003    0.000    0.006    0.000 {method 'get' of 'dict' objects}
23704/22175    0.003    0.000    0.003    0.000 {built-in method builtins.len}
2986/2855    0.003    0.000    0.004    0.000 {method 'format' of 'str' objects}
      680    0.003    0.000    0.150    0.000 __init__.py:2126(distributions_from_metadata)
    22762    0.003    0.000    0.003    0.000 {method 'rstrip' of 'str' objects}
     9152    0.003    0.000    0.004    0.000 sre_parse.py:164(__getitem__)
     5258    0.003    0.000    0.250    0.000 <frozen importlib._bootstrap_external>:74(_path_stat)
      689    0.003    0.000    0.003    0.000 functions.py:87(__init__)
     1149    0.003    0.000    0.010    0.000 <frozen importlib._bootstrap_external>:1351(_get_spec)
     1161    0.003    0.000    0.005    0.000 <frozen importlib._bootstrap>:318(__exit__)
    29445    0.003    0.000    0.003    0.000 {method 'append' of 'list' objects}
    17714    0.003    0.000    0.003    0.000 abstract.py:99(key)
1161/1158    0.003    0.000    0.177    0.000 <frozen importlib._bootstrap>:576(module_from_spec)
      346    0.003    0.000    0.063    0.000 __init__.py:2585(from_location)
 1568/435    0.003    0.000    0.003    0.000 sre_parse.py:174(getwidth)
      501    0.003    0.000    0.029    0.000 textwrap.py:414(dedent)
     1154    0.003    0.000    0.069    0.000 re.py:271(_compile)
        2    0.002    0.001    0.002    0.001 {method 'poll' of 'select.poll' objects}
     3626    0.002    0.000    0.004    0.000 typing.py:704(__setattr__)
      776    0.002    0.000    0.004    0.000 functools.py:37(update_wrapper)
     7138    0.002    0.000    0.003    0.000 {built-in method builtins.setattr}
      788    0.002    0.000    0.006    0.000 version.py:131(_legacy_cmpkey)
     1800    0.002    0.000    0.010    0.000 <frozen importlib._bootstrap>:194(_lock_unlock_module)
      131    0.002    0.000    0.003    0.000 templates.py:816(make_overload_attribute_template)
      923    0.002    0.000    0.003    0.000 posixpath.py:75(join)
18763/18019    0.002    0.000    0.002    0.000 {built-in method builtins.hash}
     1149    0.002    0.000    0.012    0.000 <frozen importlib._bootstrap_external>:369(_get_cached)
      150    0.002    0.000    0.005    0.000 <frozen importlib._bootstrap_external>:1190(_path_hooks)
      945    0.002    0.000    0.002    0.000 {built-in method _imp._fix_co_filename}
     3092    0.002    0.000    0.003    0.000 {built-in method builtins.any}
     2519    0.002    0.000    0.003    0.000 <frozen importlib._bootstrap>:416(parent)
     4454    0.002    0.000    0.002    0.000 {method 'match' of 're.Pattern' objects}
     1890    0.002    0.000    0.003    0.000 <frozen importlib._bootstrap_external>:62(_path_split)
     2032    0.002    0.000    0.007    0.000 castgraph.py:41(insert)
  580/180    0.002    0.000    0.036    0.000 sre_parse.py:411(_parse_sub)
      481    0.002    0.000    0.007    0.000 typing.py:603(__init__)
     3434    0.002    0.000    0.003    0.000 version.py:114(_parse_version_parts)
     8269    0.002    0.000    0.002    0.000 {method 'group' of 're.Match' objects}
     2835    0.002    0.000    0.003    0.000 <frozen importlib._bootstrap_external>:51(_r_long)
     2094    0.002    0.000    0.014    0.000 <frozen importlib._bootstrap>:403(cached)
      454    0.002    0.000    0.019    0.000 __init__.py:1327(safe_version)
     1226    0.002    0.000    0.002    0.000 <frozen importlib._bootstrap>:176(cb)
2306/2299    0.002    0.000    0.002    0.000 abstract.py:124(__eq__)
     3816    0.002    0.000    0.002    0.000 <frozen importlib._bootstrap>:859(__exit__)
      960    0.001    0.000    0.004    0.000 typing.py:113(_type_check)
      612    0.001    0.000    0.002    0.000 _inspect.py:67(getargs)
     4435    0.001    0.000    0.001    0.000 {method 'endswith' of 'str' objects}
     2215    0.001    0.000    0.227    0.000 <frozen importlib._bootstrap_external>:93(_path_isfile)
   706/50    0.001    0.000    1.575    0.031 {built-in method builtins.__import__}
     1226    0.001    0.000    0.002    0.000 <frozen importlib._bootstrap>:58(__init__)
  680/342    0.001    0.000    0.002    0.000 {built-in method _abc._abc_subclasscheck}
     1025    0.001    0.000    0.001    0.000 {method 'split' of 're.Pattern' objects}
       75    0.001    0.000    1.800    0.024 __init__.py:1(<module>)
     1239    0.001    0.000    0.012    0.000 <frozen importlib._bootstrap>:147(__enter__)
  152/132    0.001    0.000    0.023    0.000 {built-in method builtins.sorted}
1410/1302    0.001    0.000    0.018    0.000 typing.py:248(inner)
     4300    0.001    0.000    0.002    0.000 version.py:65(_compare)
  818/815    0.001    0.000    0.009    0.000 abstract.py:63(__call__)
      945    0.001    0.000    0.002    0.000 <frozen importlib._bootstrap_external>:471(_validate_timestamp_pyc)
     6786    0.001    0.000    0.001    0.000 {built-in method posix.fspath}
     5184    0.001    0.000    0.001    0.000 {built-in method builtins.min}
     8223    0.001    0.000    0.001    0.000 {built-in method _imp.acquire_lock}
     1910    0.001    0.000    0.007    0.000 <frozen importlib._bootstrap_external>:1203(_path_importer_cache)
     1388    0.001    0.000    0.001    0.000 <frozen importlib._bootstrap>:369(__init__)
      346    0.001    0.000    0.003    0.000 __init__.py:686(add)
      279    0.001    0.000    0.001    0.000 {built-in method _abc._abc_init}
     1208    0.001    0.000    0.318    0.000 <frozen importlib._bootstrap_external>:1272(find_spec)
      950    0.001    0.000    0.001    0.000 <frozen importlib._bootstrap>:35(_new_module)
     3118    0.001    0.000    0.001    0.000 version.py:207(<genexpr>)
      454    0.001    0.000    0.003    0.000 version.py:236(__str__)
      886    0.001    0.000    0.002    0.000 __init__.py:2644(key)
     2921    0.001    0.000    0.001    0.000 {built-in method from_bytes}
      286    0.001    0.000    0.005    0.000 overrides.py:72(verify_matching_signatures)
   1768/1    0.001    0.000    5.156    5.156 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
     8223    0.001    0.000    0.001    0.000 {built-in method _imp.release_lock}
      818    0.001    0.000    0.004    0.000 abstract.py:51(_intern)
      945    0.001    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:401(_check_name_wrapper)
      590    0.001    0.000    0.041    0.000 __init__.py:2772(_get_metadata)
      496    0.001    0.000    0.002    0.000 posixpath.py:154(dirname)
      349    0.001    0.000    0.002    0.000 extending.py:57(overload)
      584    0.001    0.000    0.002    0.000 enum.py:70(__setitem__)
  344/238    0.001    0.000    0.009    0.000 typing.py:340(__getitem__)
     3626    0.001    0.000    0.002    0.000 typing.py:590(_is_dunder)
       66    0.001    0.000    0.001    0.000 templates.py:667(make_intrinsic_template)
     3816    0.001    0.000    0.001    0.000 <frozen importlib._bootstrap>:855(__enter__)
      349    0.001    0.000    0.012    0.000 extending.py:111(decorate)
     1124    0.001    0.000    0.015    0.000 version.py:24(parse)
       62    0.001    0.000    0.001    0.000 {built-in method cupy.core._kernel.create_ufunc}
      360    0.001    0.000    0.180    0.000 __init__.py:2039(find_on_path)
     1708    0.001    0.000    0.002    0.000 __init__.py:2071(dist_factory)
      180    0.001    0.000    0.064    0.000 sre_compile.py:759(compile)
      945    0.001    0.000    0.011    0.000 <frozen importlib._bootstrap_external>:951(path_stats)
1533/1102    0.001    0.000    0.001    0.000 typing.py:664(__hash__)
     6095    0.001    0.000    0.001    0.000 {built-in method _thread.get_ident}
     1161    0.001    0.000    0.001    0.000 <frozen importlib._bootstrap>:311(__enter__)
      471    0.001    0.000    0.002    0.000 copy.py:268(_reconstruct)
     2268    0.001    0.000    0.001    0.000 {method 'split' of 'str' objects}
      150    0.001    0.000    0.025    0.000 <frozen importlib._bootstrap_external>:1404(_fill_cache)
        8    0.001    0.000    0.001    0.000 {method 'read' of '_io.TextIOWrapper' objects}
     2032    0.001    0.000    0.003    0.000 castgraph.py:50(get)
      686    0.001    0.000    0.001    0.000 genericpath.py:121(_splitext)
     2044    0.001    0.000    0.002    0.000 castgraph.py:67(__getitem__)
     4297    0.001    0.000    0.001    0.000 sre_parse.py:249(match)
     1239    0.001    0.000    0.004    0.000 <frozen importlib._bootstrap>:151(__exit__)
      472    0.001    0.000    0.003    0.000 copy.py:66(copy)
      316    0.001    0.000    0.026    0.000 overrides.py:166(decorator)
      180    0.001    0.000    0.006    0.000 sre_compile.py:536(_compile_info)
     1766    0.001    0.000    0.001    0.000 {method 'update' of 'dict' objects}
     2247    0.001    0.000    0.001    0.000 {method 'rfind' of 'str' objects}
     3058    0.001    0.000    0.002    0.000 {method 'add' of 'set' objects}
     1208    0.001    0.000    0.001    0.000 {built-in method _imp.is_frozen}
      807    0.001    0.000    0.002    0.000 ffi.py:56(__getattr__)
     2229    0.001    0.000    0.003    0.000 abstract.py:127(__ne__)
      355    0.001    0.000    0.001    0.000 ntpath.py:122(splitdrive)
      686    0.001    0.000    0.002    0.000 posixpath.py:121(splitext)
      213    0.001    0.000    0.001    0.000 {method 'splitlines' of 'str' objects}
      335    0.001    0.000    0.001    0.000 pyparsing.py:1144(__init__)
     4642    0.001    0.000    0.001    0.000 <frozen importlib._bootstrap>:321(<genexpr>)
        4    0.001    0.000    0.001    0.000 {method 'readlines' of '_io._IOBase' objects}
     1239    0.001    0.000    0.001    0.000 <frozen importlib._bootstrap>:143(__init__)
      630    0.001    0.000    0.001    0.000 enum.py:376(__setattr__)
     1080    0.001    0.000    0.001    0.000 {method 'extend' of 'list' objects}
      355    0.001    0.000    0.003    0.000 __init__.py:1494(_validate_resource_path)
      150    0.001    0.000    0.002    0.000 <frozen importlib._bootstrap_external>:1319(__init__)
      614    0.001    0.000    0.001    0.000 sre_compile.py:249(_compile_charset)
     2172    0.001    0.000    0.001    0.000 posixpath.py:41(_get_sep)
      353    0.001    0.000    0.001    0.000 posixpath.py:144(basename)
      945    0.001    0.000    0.001    0.000 <frozen importlib._bootstrap_external>:884(__init__)
     2594    0.001    0.000    0.001    0.000 sre_parse.py:172(append)
      616    0.001    0.000    0.003    0.000 _inspect.py:98(getargspec)

harrism commented 4 years ago

Was this on a DGX-1 as previously reported?

The first CUDA call will initialise the context, as Jake pointed out. The time will be higher on multiGPU machines. There is also a cost that is proportional to the sysmem size, I believe.

Loading all the libcudf device code (currently near 300MB) to each GPU is also a substantial startup cost. Eliminating legacy should help this. Also we could either move to device-side dispatch where it makes sense (binops, reductions), and/or do more JIT.

kkraus14 commented 4 years ago

@harrism running import cudf doesn't actually initialize a CUDA context or copy the device code to device memory. It does initialize the driver in validating the GPU architecture / driver version / toolkit version and it looks like that's taking ~3s per above. There's an additional ~2s just in Python imports but I'm not seeing anything stick out in the profile above that's immediately actionable.

harrism commented 4 years ago

What does getDeviceCount call?

kkraus14 commented 4 years ago

What does getDeviceCount call?

cudaGetDeviceCount: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g18808e54893cfcaafefeab31a73cc55f

This apparently doesn't actually create a context though.

brandon-b-miller commented 4 years ago

This was on a DGX-1. I have noticed that subsequent imports don't take nearly as long. Excluding the three seconds up front to call that cuda function, my profile looks fairly similar to the original one in this thread. So I'm actually not sure if 3 seconds spent inside getDeviceCount is what the issue was originally raised about, it might have been the rest of the time spent modulo the cuda work.

harrism commented 4 years ago

This apparently doesn't actually create a context though.

Where do the docs say that?

kkraus14 commented 4 years ago

This apparently doesn't actually create a context though.

Where do the docs say that?

They don't, an internal slack thread @jrhemstad had with CUDA developers confirmed it though. I can try to dig up a link if you're interested.

harrism commented 4 years ago

OK, then next question: can we find out if all of the 3s in the profile for that call are in the CUDA runtime/driver? (i.e. is there any Numba overhead?)

kkraus14 commented 4 years ago

OK, then next question: can we find out if all of the 3s in the profile for that call are in the CUDA runtime/driver? (i.e. is there any Numba overhead?)

These aren't called via Numba, we wrote our own Cython bindings here where there should be basically zero overhead.

harrism commented 4 years ago

Sorry, misunderstood. Well, I'm guessing the runtime is doing some sort of initialization at that call...

kkraus14 commented 4 years ago

Sorry, misunderstood. Well, I'm guessing the runtime is doing some sort of initialization at that call...

From my understanding it does the driver API initialization that happens implicitly when a context is created. Basically a cuInit(0) call. https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__INITIALIZE.html

harrism commented 4 years ago

Can you try compiling this? Compile this code:

#include <stdio.h>
#include <chrono>

int main()
{
  int numgpus=0;
  printf("starting\n");
  for (int i = 0;  i < 2; i++) {
    auto start = std::chrono::high_resolution_clock::now();
    cudaGetDeviceCount(&numgpus);
    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;

    printf("iter: %d: GPUs: %d, elapsed time: %lf\n", i, numgpus, elapsed_seconds.count());
  }
  return 0;
}

Using nvcc test.cu

When I run on a DGX I get the following. Second result is limiting visible devices to a single GPU (it's faster).

(base) mharris@dgx02:~$ ./a.out
starting
iter: 0: GPUs: 8, elapsed time: 0.394263
iter: 1: GPUs: 8, elapsed time: 0.000000
(base) mharris@dgx02:~$ CUDA_VISIBLE_DEVICES=3 ./a.out
starting
iter: 0: GPUs: 1, elapsed time: 0.079943
iter: 1: GPUs: 1, elapsed time: 0.000000

So with 8 GPUs it takes .39s -- not 3s, but still significant. And the second time it takes less than a microsecond. It must be doing something. I think that something is initializing the driver context. And if libcudf is linked into the module, then its 300MB fatbin is probably being loaded.

You can try on your DGX to see if its different.

kkraus14 commented 4 years ago

And if libcudf is linked into the module, then its 300MB fatbin is probably being loaded.

libcudf is absolutely linked into the module.

jakirkham commented 4 years ago

There were some untyped variables in those Cython bindings. Doubt that matters much, but it may affect this a little bit and is easy to fix. Sent PR ( https://github.com/rapidsai/cudf/pull/4925 ) to type them.

jakirkham commented 4 years ago

Reworked the original proposal as there were some device functions that needed some changes to handle their errors respectively. Otherwise largely the same idea. Submitted as PR ( https://github.com/rapidsai/cudf/pull/4943 ).

jakirkham commented 4 years ago

Retested this and it appears to be ~2s now. Longer or shorter depending on the run. Occasionally see an outlier.

kkraus14 commented 4 years ago

Retested this and it appears to be ~2s now. Longer or shorter depending on the run. Occasionally see an outlier.

What kind of machine did you test this on? I suspect this scales somewhat linearly with the number of visible GPUs due to the driver initialization.

jakirkham commented 4 years ago

DGX-1. Were the previous measurements all on DGX-1s or did some of them use other machines?

jakirkham commented 4 years ago

FWIW the import time appears to be the same if I restrict to a single device. Tested the following on a DGX-1.

All GPUs:

$ ipython
Python 3.7.6 | packaged by conda-forge | (default, Mar 23 2020, 23:03:20) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.14.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: %time import cudf                                                       
CPU times: user 4.28 s, sys: 4.02 s, total: 8.29 s
Wall time: 2.12 s

In [2]: cudf._cuda.gpu.getDeviceCount()                                         
Out[2]: 8

One GPU:

$ CUDA_VISIBLE_DEVICES=0 ipython
Python 3.7.6 | packaged by conda-forge | (default, Mar 23 2020, 23:03:20) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.14.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: %time import cudf                                                       
CPU times: user 4.28 s, sys: 3.57 s, total: 7.85 s
Wall time: 2.14 s

In [2]: cudf._cuda.gpu.getDeviceCount()                                         
Out[2]: 1

Again there is some variability above and below 2s. However this is representative of what I'm seeing.

harrism commented 4 years ago

WIth 0.14 and legacy removed, I thought this might have improved... I tried on my linux desktop with a single V100 GPU and importing cudf took 776 ms. But I also tried running in a container on a DGX-1 with only one of the V100s visible and it took 15.5s! Could being in a container make it slow?

kkraus14 commented 4 years ago

WIth 0.14 and legacy removed, I thought this might have improved... I tried on my linux desktop with a single V100 GPU and importing cudf took 776 ms. But I also tried running in a container on a DGX-1 with only one of the V100s visible and it took 15.5s! Could being in a container make it slow?

That may be that the DGX didn't have persistence mode enabled and it took a long time for the driver to initialize because the GPUs had to "wake up".

harrism commented 4 years ago

Persistence mode is on.

harrism commented 4 years ago

More data points:

%time from numba import cuda: 1.04s
%time import rmm: 1.29s

I tried importing cudf on the same DGX outside of my rapids-compose container, but presumably my installation isn't quite right because I'm building from source, and it couldn't find a numba submodule. HOWEVER, before it hit that error, I counted for about 12 seconds. That numba loading was inside RMM, which is imported at the beginning of cuDF's __init__.py.

Not sure if that tells us anything.

lmeyerov commented 2 years ago

It adds up on real rapids codes, e.g., 4s for an empty cudf df

Some observations:

pandas is 1+ s on its own and maintainers seem disinterested: https://github.com/pandas-dev/pandas/issues/7282#issuecomment-442637703
cudf is now 2.7+ s import, and add another 1-2s for even an empty df . So 1s blameable on pandas, but the other ~ 3s+ seems fair game..
dask_cudf adds another 0.5+ s for import, and I didn't measure importing a client and connecting to an existing cluster, but presumably much more

We've been chasing down why we're seeing 20s-60s starts for a boring rapids web app init that is just setting up routes and running some cudf + udf warmup routines, so thought this may help. It's been frustrating for production b/c slows down first start + autorestarts. More nice-to-have for us, but maybe more important for others, are of course local dev + testing, and longer-term, precluding fast cold starts for GPU FaaS.

A few more numbers:

(rapids) root@8128f9b31ecd:/opt/graphistry/apps/forge/etl-server-python# time python -c "1"

real    0m0.047s
user    0m0.043s
sys 0m0.004s
(rapids) root@8128f9b31ecd:/opt/graphistry/apps/forge/etl-server-python# time python -c "import logging"

real    0m0.080s
user    0m0.068s
sys 0m0.012s
(rapids) root@8128f9b31ecd:/opt/graphistry/apps/forge/etl-server-python# time python -c "import pandas"

real    0m1.011s
user    0m1.059s
sys 0m0.626s
(rapids) root@8128f9b31ecd:/opt/graphistry/apps/forge/etl-server-python# time python -c "import cudf"

real    0m2.692s
user    0m2.607s
sys 0m0.811s
(rapids) root@8128f9b31ecd:/opt/graphistry/apps/forge/etl-server-python# time python -c "import cudf; cudf.DataFrame({'x': []})"

real    0m3.558s
user    0m3.211s
sys 0m1.060s
(rapids) root@8128f9b31ecd:/opt/graphistry/apps/forge/etl-server-python# time python -c "import dask_cudf; import cudf; cudf.DataFrame({'x': []})"

real    0m4.015s
user    0m3.547s
sys 0m0.794s

And related to the above:

(rapids) root@8128f9b31ecd:/opt/graphistry/apps/forge/etl-server-python# time python -c "from numba import cuda"

real    0m1.233s
user    0m1.218s
sys 0m0.757s
(rapids) root@8128f9b31ecd:/opt/graphistry/apps/forge/etl-server-python# time python -c "import rmm"

real    0m1.678s
user    0m1.688s
sys 0m0.738s

quasiben commented 2 years ago

I spent a little time with profiling the import with snakeviz. gpu_utils is taking close to ~2.5 seconds and a big chunk of that is from getDeviceCount. @shwina do you have any thoughts here ?

shwina commented 2 years ago

Hmm - nothing sticks out to me in the implementation: https://github.com/NVIDIA/cuda-python/blob/746b773c91e1ede708fe9a584b8cdb1c0f32b51d/cuda/_lib/ccudart/ccudart.pyx#L1458-L1466

I suspect the overhead we're seeing here is from the initialization of the CUDA context.

shwina commented 2 years ago

I can't think of a great way around this. It sounds like the lower bound to our import time is the time for CUDA context initialization, which apparently scales as the square of the number of GPUs.

validate_setup sounds like something we definitely want to do at import time (not later), but we could potentially expose something like an environment variable that allows skipping the check altogether? Would that be useful?

Note that regardless, the CUDA context will eventually need to be initialized at some point if we're going to use the GPU for anything useful.

lmeyerov commented 2 years ago

Googling around suggests C++-level cuda context creation is expected to be closer to ~100ms than 1s+. We're all python nowadays so I don't have any C++ handy for seeing if that's still true or maybe there's a different notion of context. (My tests are single GPU, not DGX.)

Maybe also something about how cpython links stuff at runtime? I've been thinking we might have python import issues around path search or native imports / runtime linking, but haven't dug deep on that.

jakirkham commented 2 years ago

Are your workloads being run on the cloud? If so, it is possible that these are being run on multi-tenant nodes, which have more GPUs than are actually given to any one user. If this is the case, something to play with potentially would be restricting the CUDA_VISIBLE_DEVICES as done above to see if that has an effect. This would provide some clue as to whether or not this is the issue.

lmeyerov commented 2 years ago

I've seen for various local single GPUs, not just cloud: Local, windows -> wsl2 -> ubuntu docker -> rtx3080, local ubuntu docker -> rtx3080, and local ubuntu docker -> some old geforce.

Happy to test any cpp/py...

GregoryKimball commented 2 years ago

@vyasr Is there a reasonable way to add cudf import time tracking to the python benchmarks?

vyasr commented 2 years ago

I think that would be valuable. The issue is that a lot of the slow parts of the import have to do with initialization logic that only occurs the first time that you import cudf. For example:

In [1]: %timeit import cudf
The slowest run took 22.42 times longer than the fastest. This could mean that an intermediate result is being cached.
670 ns ± 1.19 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [2]: %timeit import cudf
77 ns ± 0.189 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Since cudf gets imported as part of the collection process for pytest and at the top every module, we'd be obscuring most of that overhead if we tried to incorporate it as part of the same benchmarking suite. If we want to benchmark this, we should just have a separate and very simple script that just times the import and exits.

If we really care about accuracy it would be a bit more involved. We would need the script to do something like launch subprocesses (probably serially to avoid any contention issues around NVIDIA driver context creation) that each run the import command and then collect the results. I don't think we need to go that far, though. Even very rough timings will probably tell us what we need.

jakirkham commented 1 year ago

Something we might also consider is using this lazy loader framework

https://github.com/scientific-python/lazy_loader

jrhemstad commented 1 year ago

I suspect the addition of CUDA Lazy Loading should dramatically improve this situation. This was added in 11.7 and has been improved in subsequent releases.

This is currently an optional feature and can be enabled by specifying the environment variable CUDA_MODULE_LOADING=LAZY at runtime.

vyasr commented 1 year ago

Agreed, CUDA lazy loading should help, although not sure how much. Last I checked import time was distributed across dependencies as well (numba, cupy, rmm) so we'll need to do some careful profiling again with lazy loading enabled to see what else those libraries are doing and how slow other parts of their import are (as well as the validation that cudf does on import, which adds significant overhead).

wence- commented 1 year ago

I had a brief look with pyinstrument:

$ pyinstrument importcudf.py

  _     ._   __/__   _ _  _  _ _/_   Recorded: 16:40:36  Samples:  1201
 /_//_/// /_\ / //_// / //_'/ //     Duration: 2.298     CPU time: 1.758
/   _/                      v4.4.0

Program: importcudf.py

2.297 <module>  importcudf.py:1
└─ 2.297 <module>  cudf/__init__.py:1
   ├─ 1.064 validate_setup  cudf/utils/gpu_utils.py:4
   │  ├─ 0.977 <module>  rmm/__init__.py:1
   │  │  ├─ 0.613 <module>  rmm/rmm.py:1
   │  │  │  ├─ 0.350 <module>  torch/__init__.py:1
   │  │  │  │     [617 frames hidden]  torch, <built-in>, typing, collection...
   │  │  │  └─ 0.261 <module>  cupy/__init__.py:1
   │  │  │        [668 frames hidden]  cupy, pytest, _pytest, attr, <built-i...
   │  │  ├─ 0.330 <module>  rmm/mr.py:1
   │  │  │  └─ 0.330 <module>  rmm/_lib/__init__.py:1
   │  │  │     ├─ 0.280 <module>  numba/__init__.py:1
   │  │  │     │     [739 frames hidden]  numba, numpy, re, sre_compile, sre_pa...
   │  │  │     └─ 0.047 <module>  numba/cuda/__init__.py:1
   │  │  │           [163 frames hidden]  numba, asyncio, ssl, enum, <built-in>...
   │  │  └─ 0.034 get_versions  rmm/_version.py:515
   │  │     └─ 0.034 git_pieces_from_vcs  rmm/_version.py:234
   │  │        └─ 0.034 run_command  rmm/_version.py:71
   │  │           └─ 0.027 Popen.communicate  subprocess.py:1110
   │  │                 [7 frames hidden]  subprocess, <built-in>, selectors
   │  └─ 0.031 getDeviceCount  rmm/_cuda/gpu.py:91
   ├─ 0.390 _setup_numba_linker  cudf/core/udf/utils.py:388
   │  └─ 0.390 safe_get_versions  ptxcompiler/patch.py:212
   │        [12 frames hidden]  ptxcompiler, subprocess, selectors, <...
   ├─ 0.308 <module>  cudf/api/__init__.py:1
   │  └─ 0.301 <module>  cudf/api/extensions/__init__.py:1
   │     └─ 0.301 <module>  cudf/api/extensions/accessor.py:1
   │        └─ 0.300 <module>  pandas/__init__.py:1
   │              [694 frames hidden]  pandas, pyarrow, copy, <built-in>, te...
   ├─ 0.280 <module>  cudf/datasets.py:1
   │  ├─ 0.253 <module>  cudf/_lib/__init__.py:1
   │  │  ├─ 0.124 <module>  cudf/utils/ioutils.py:1
   │  │  │  ├─ 0.096 <module>  pyarrow/fs.py:1
   │  │  │  │     [4 frames hidden]  pyarrow, <built-in>
   │  │  │  └─ 0.026 <module>  fsspec/__init__.py:1
   │  │  │        [79 frames hidden]  fsspec, importlib, pathlib, <built-in...
   │  │  ├─ 0.079 [self]  None
   │  │  └─ 0.043 <module>  cudf/_lib/strings/__init__.py:1
   │  └─ 0.024 <module>  cudf/core/column_accessor.py:1
   │     └─ 0.024 <module>  cudf/core/column/__init__.py:1
   ├─ 0.135 <module>  cudf/core/algorithms.py:1
   │  ├─ 0.097 <module>  cudf/core/indexed_frame.py:1
   │  │  └─ 0.090 <module>  cudf/core/groupby/__init__.py:1
   │  │     └─ 0.090 <module>  cudf/core/groupby/groupby.py:1
   │  │        ├─ 0.053 <module>  cudf/core/udf/groupby_utils.py:1
   │  │        │  └─ 0.053 _get_ptx_file  cudf/core/udf/utils.py:291
   │  │        │     └─ 0.053 get_current_device  numba/cuda/api.py:433
   │  │        │           [10 frames hidden]  numba
   │  │        └─ 0.031 <module>  cudf/core/udf/__init__.py:1
   │  │           └─ 0.027 <module>  cudf/core/udf/row_function.py:1
   │  └─ 0.031 <module>  cudf/core/index.py:1
   │     └─ 0.025 Float32Index.__init_subclass__  cudf/core/mixins/mixin_factory.py:212
   │        └─ 0.023 TimedeltaIndex._should_define_operation  cudf/core/mixins/mixin_factory.py:77
   └─ 0.086 get_versions  cudf/_version.py:515
      └─ 0.085 git_pieces_from_vcs  cudf/_version.py:235
         └─ 0.085 run_command  cudf/_version.py:72
            └─ 0.077 Popen.communicate  subprocess.py:1110
                  [7 frames hidden]  subprocess, <built-in>, selectors

To view this report with different options, run:
    pyinstrument --load-prev 2023-02-16T16-40-36 [options]

CUDA_MODULE_LOADING=LAZY doesn't really seem to help (I have host driver == cuda-12, and cuda-11.8 in a rapids-compose environment).

Adding RAPIDS_NO_INITIALIZE=1 just moves some time around

2.277 <module>  importcudf.py:1
└─ 2.277 <module>  cudf/__init__.py:1
   ├─ 0.453 <module>  rmm/__init__.py:1
   │  ├─ 0.352 <module>  rmm/rmm.py:1
   │  │  └─ 0.351 <module>  torch/__init__.py:1
   │  │        [663 frames hidden]  torch, <built-in>, typing, enum, pick...
   │  ├─ 0.067 <module>  rmm/mr.py:1
   │  │  └─ 0.067 <module>  rmm/_lib/__init__.py:1
   │  │     └─ 0.045 __new__  enum.py:180
   │  │           [15 frames hidden]  enum, <built-in>
   │  └─ 0.034 get_versions  rmm/_version.py:515
   │     └─ 0.034 git_pieces_from_vcs  rmm/_version.py:234
   │        └─ 0.034 run_command  rmm/_version.py:71
   │           └─ 0.027 Popen.communicate  subprocess.py:1110
   │                 [7 frames hidden]  subprocess, <built-in>, selectors
   ├─ 0.407 _setup_numba_linker  cudf/core/udf/utils.py:388
   │  └─ 0.407 safe_get_versions  ptxcompiler/patch.py:212
   │        [11 frames hidden]  ptxcompiler, subprocess, selectors, <...
   ├─ 0.364 <module>  cupy/__init__.py:1
   │     [1068 frames hidden]  cupy, pytest, _pytest, <built-in>, at...
   ├─ 0.356 <module>  cudf/api/__init__.py:1
   │  └─ 0.348 <module>  cudf/api/extensions/__init__.py:1
   │     └─ 0.348 <module>  cudf/api/extensions/accessor.py:1
   │        └─ 0.346 <module>  pandas/__init__.py:1
   │              [700 frames hidden]  pandas, pyarrow, <built-in>, textwrap...
   ├─ 0.286 <module>  cudf/datasets.py:1
   │  ├─ 0.258 <module>  cudf/_lib/__init__.py:1
   │  │  ├─ 0.126 <module>  cudf/utils/ioutils.py:1
   │  │  │  ├─ 0.097 <module>  pyarrow/fs.py:1
   │  │  │  │     [4 frames hidden]  pyarrow, <built-in>
   │  │  │  └─ 0.027 <module>  fsspec/__init__.py:1
   │  │  │        [86 frames hidden]  fsspec, importlib, pathlib, <built-in...
   │  │  ├─ 0.084 [self]  None
   │  │  └─ 0.044 <module>  cudf/_lib/strings/__init__.py:1
   │  └─ 0.025 <module>  cudf/core/column_accessor.py:1
   │     └─ 0.024 <module>  cudf/core/column/__init__.py:1
   ├─ 0.183 <module>  numba/__init__.py:1
   │     [495 frames hidden]  numba, llvmlite, ctypes, <built-in>, ...
   ├─ 0.085 get_versions  cudf/_version.py:515
   │  └─ 0.084 git_pieces_from_vcs  cudf/_version.py:235
   │     └─ 0.084 run_command  cudf/_version.py:72
   │        └─ 0.076 Popen.communicate  subprocess.py:1110
   │              [7 frames hidden]  subprocess, <built-in>, selectors
   ├─ 0.081 <module>  cudf/core/algorithms.py:1
   │  ├─ 0.042 <module>  cudf/core/indexed_frame.py:1
   │  │  └─ 0.036 <module>  cudf/core/groupby/__init__.py:1
   │  │     └─ 0.036 <module>  cudf/core/groupby/groupby.py:1
   │  │        └─ 0.031 <module>  cudf/core/udf/__init__.py:1
   │  │           └─ 0.027 <module>  cudf/core/udf/row_function.py:1
   │  └─ 0.032 <module>  cudf/core/index.py:1
   │     └─ 0.024 UInt64Index.__init_subclass__  cudf/core/mixins/mixin_factory.py:212
   │        └─ 0.024 UInt64Index._should_define_operation  cudf/core/mixins/mixin_factory.py:77
   │           └─ 0.024 dir  None
   │                 [2 frames hidden]  <built-in>
   └─ 0.026 <module>  numba/cuda/__init__.py:1
         [86 frames hidden]  numba, <built-in>, collections, llvml...

Numba forks a subprocess to compile some cuda code to determine if it needs to patch the ptx linker. We can turn that off

1.895 <module>  importcudf.py:1
└─ 1.895 <module>  cudf/__init__.py:1
   ├─ 0.468 <module>  rmm/__init__.py:1
   │  ├─ 0.359 <module>  rmm/rmm.py:1
   │  │  └─ 0.358 <module>  torch/__init__.py:1
   │  │        [687 frames hidden]  torch, <built-in>, typing, enum, insp...
   │  ├─ 0.073 <module>  rmm/mr.py:1
   │  │  └─ 0.072 <module>  rmm/_lib/__init__.py:1
   │  │     └─ 0.052 __new__  enum.py:180
   │  │           [23 frames hidden]  enum, <built-in>
   │  └─ 0.036 get_versions  rmm/_version.py:515
   │     └─ 0.036 git_pieces_from_vcs  rmm/_version.py:234
   │        └─ 0.036 run_command  rmm/_version.py:71
   │           └─ 0.029 Popen.communicate  subprocess.py:1110
   │                 [7 frames hidden]  subprocess, <built-in>, selectors
   ├─ 0.363 <module>  cupy/__init__.py:1
   │     [1034 frames hidden]  cupy, pytest, _pytest, <built-in>, at...
   ├─ 0.362 <module>  cudf/api/__init__.py:1
   │  └─ 0.354 <module>  cudf/api/extensions/__init__.py:1
   │     └─ 0.354 <module>  cudf/api/extensions/accessor.py:1
   │        └─ 0.353 <module>  pandas/__init__.py:1
   │              [722 frames hidden]  pandas, pyarrow, textwrap, <built-in>...
   ├─ 0.284 <module>  cudf/datasets.py:1
   │  ├─ 0.257 <module>  cudf/_lib/__init__.py:1
   │  │  ├─ 0.127 <module>  cudf/utils/ioutils.py:1
   │  │  │  ├─ 0.098 <module>  pyarrow/fs.py:1
   │  │  │  │     [4 frames hidden]  pyarrow, <built-in>
   │  │  │  └─ 0.028 <module>  fsspec/__init__.py:1
   │  │  │        [97 frames hidden]  fsspec, importlib, pathlib, <built-in...
   │  │  ├─ 0.082 [self]  None
   │  │  └─ 0.043 <module>  cudf/_lib/strings/__init__.py:1
   │  └─ 0.024 <module>  cudf/core/column_accessor.py:1
   │     └─ 0.024 <module>  cudf/core/column/__init__.py:1
   ├─ 0.194 <module>  numba/__init__.py:1
   │     [490 frames hidden]  numba, llvmlite, ctypes, <built-in>, ...
   ├─ 0.081 <module>  cudf/core/algorithms.py:1
   │  ├─ 0.043 <module>  cudf/core/indexed_frame.py:1
   │  │  └─ 0.036 <module>  cudf/core/groupby/__init__.py:1
   │  │     └─ 0.036 <module>  cudf/core/groupby/groupby.py:1
   │  │        └─ 0.031 <module>  cudf/core/udf/__init__.py:1
   │  │           └─ 0.027 <module>  cudf/core/udf/row_function.py:1
   │  │              └─ 0.021 <module>  cudf/core/udf/utils.py:1
   │  └─ 0.032 <module>  cudf/core/index.py:1
   │     └─ 0.024 GenericIndex.__init_subclass__  cudf/core/mixins/mixin_factory.py:212
   │        └─ 0.022 GenericIndex._should_define_operation  cudf/core/mixins/mixin_factory.py:77
   │           └─ 0.020 dir  None
   │                 [2 frames hidden]  <built-in>
   ├─ 0.078 get_versions  cudf/_version.py:515
   │  └─ 0.078 git_pieces_from_vcs  cudf/_version.py:235
   │     └─ 0.078 run_command  cudf/_version.py:72
   │        └─ 0.071 Popen.communicate  subprocess.py:1110
   │              [7 frames hidden]  subprocess, <built-in>, selectors
   ├─ 0.029 <module>  numba/cuda/__init__.py:1
   │     [110 frames hidden]  numba, collections, re, types, <built...
   └─ 0.022 <module>  cudf/io/__init__.py:1

RMM imports pytorch if it's in the environment

try:
    from torch.cuda.memory import CUDAPluggableAllocator
except ImportError:
    rmm_torch_allocator = None
else:
    import rmm._lib.torch_allocator

    _alloc_free_lib_path = rmm._lib.torch_allocator.__file__
    rmm_torch_allocator = CUDAPluggableAllocator(
        _alloc_free_lib_path,
        alloc_fn_name="allocate",
        free_fn_name="deallocate",
    )

Let's suppose we fix that with lazy-loading.

1.469 <module>  importcudf.py:1
└─ 1.467 <module>  cudf/__init__.py:1
   ├─ 0.360 <module>  cupy/__init__.py:1
   │     [1023 frames hidden]  cupy, pytest, _pytest, <built-in>, at...
   ├─ 0.282 <module>  cudf/api/__init__.py:1
   │  └─ 0.274 <module>  cudf/api/extensions/__init__.py:1
   │     └─ 0.274 <module>  cudf/api/extensions/accessor.py:1
   │        └─ 0.272 <module>  pandas/__init__.py:1
   │              [695 frames hidden]  pandas, pyarrow, <built-in>, textwrap...
   ├─ 0.275 <module>  cudf/datasets.py:1
   │  ├─ 0.249 <module>  cudf/_lib/__init__.py:1
   │  │  ├─ 0.125 <module>  cudf/utils/ioutils.py:1
   │  │  │  ├─ 0.096 <module>  pyarrow/fs.py:1
   │  │  │  │     [4 frames hidden]  pyarrow, <built-in>
   │  │  │  └─ 0.028 <module>  fsspec/__init__.py:1
   │  │  │        [91 frames hidden]  fsspec, importlib, pathlib, <built-in...
   │  │  ├─ 0.077 [self]  None
   │  │  └─ 0.042 <module>  cudf/_lib/strings/__init__.py:1
   │  └─ 0.025 <module>  cudf/core/column_accessor.py:1
   │     └─ 0.024 <module>  cudf/core/column/__init__.py:1
   ├─ 0.182 <module>  numba/__init__.py:1
   │     [446 frames hidden]  numba, llvmlite, ctypes, <built-in>, ...
   ├─ 0.118 <module>  cudf/core/algorithms.py:1
   │  ├─ 0.070 <module>  cudf/core/index.py:1
   │  │  ├─ 0.041 <module>  cudf/core/frame.py:1
   │  │  │  └─ 0.039 Frame  cudf/core/frame.py:51
   │  │  │     └─ 0.039 _cudf_nvtx_annotate  cudf/utils/utils.py:385
   │  │  │        └─ 0.039 annotate.__call__  nvtx/nvtx.py:89
   │  │  │              [2 frames hidden]  nvtx
   │  │  └─ 0.025 Float64Index.__init_subclass__  cudf/core/mixins/mixin_factory.py:212
   │  │     └─ 0.024 Float64Index._should_define_operation  cudf/core/mixins/mixin_factory.py:77
   │  │        └─ 0.023 dir  None
   │  │              [2 frames hidden]  <built-in>
   │  └─ 0.041 <module>  cudf/core/indexed_frame.py:1
   │     └─ 0.035 <module>  cudf/core/groupby/__init__.py:1
   │        └─ 0.035 <module>  cudf/core/groupby/groupby.py:1
   │           └─ 0.030 <module>  cudf/core/udf/__init__.py:1
   │              └─ 0.025 <module>  cudf/core/udf/row_function.py:1
   │                 └─ 0.020 <module>  cudf/core/udf/utils.py:1
   ├─ 0.102 <module>  rmm/__init__.py:1
   │  ├─ 0.066 <module>  rmm/mr.py:1
   │  │  └─ 0.066 <module>  rmm/_lib/__init__.py:1
   │  │     └─ 0.045 __new__  enum.py:180
   │  │           [16 frames hidden]  enum, <built-in>
   │  └─ 0.034 get_versions  rmm/_version.py:515
   │     └─ 0.034 git_pieces_from_vcs  rmm/_version.py:234
   │        └─ 0.034 run_command  rmm/_version.py:71
   │           └─ 0.027 Popen.communicate  subprocess.py:1110
   │                 [7 frames hidden]  subprocess, <built-in>, selectors
   ├─ 0.088 get_versions  cudf/_version.py:515
   │  └─ 0.087 git_pieces_from_vcs  cudf/_version.py:235
   │     └─ 0.087 run_command  cudf/_version.py:72
   │        └─ 0.080 Popen.communicate  subprocess.py:1110
   │              [7 frames hidden]  subprocess, <built-in>, selectors
   ├─ 0.026 <module>  numba/cuda/__init__.py:1
   │     [100 frames hidden]  numba, <built-in>, llvmlite, inspect,...
   └─ 0.021 <module>  cudf/io/__init__.py:1

So now the largest single cost is importing cupy, which seems unavoidable. With a little bit of trimming one might get down to 1s.

lmeyerov commented 1 year ago

It takes an ecosystem with each lib chiseling away at its part, so worth it?

We think of scenarios like CPU CI, lazy load, and pay-as-you-go for partial surface use..

rapidsai / cudf

[BUG] Long import times #627