Open mrocklin opened 5 years ago
Increasing
In [1]: %time import cudf
CPU times: user 556 ms, sys: 204 ms, total: 760 ms
Wall time: 3.39 s
Looking at the profile it looks like we're doing a lot of odd things at import time
367424 function calls (356126 primitive calls) in 3.046 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
572 0.610 0.001 0.621 0.001 <frozen importlib._bootstrap_external>:914(get_data)
3250 0.605 0.000 0.605 0.000 {built-in method posix.stat}
3 0.573 0.191 0.573 0.191 {method 'read' of '_io.BufferedReader' objects}
599 0.296 0.000 0.296 0.000 {built-in method io.open}
597 0.256 0.000 0.256 0.000 {method 'close' of '_io.BufferedReader' objects}
74/73 0.079 0.001 0.082 0.001 {built-in method _imp.create_dynamic}
572 0.076 0.000 0.076 0.000 {built-in method marshal.loads}
1 0.040 0.040 0.041 0.041 nvvm.py:106(__new__)
82 0.039 0.000 0.039 0.000 {built-in method posix.listdir}
2000/1992 0.028 0.000 0.075 0.000 {built-in method builtins.__build_class__}
547 0.019 0.000 0.019 0.000 ffi.py:106(__call__)
2 0.017 0.008 0.017 0.008 {method 'dlopen' of 'CompiledFFI' objects}
74/59 0.015 0.000 0.901 0.015 {built-in method _imp.exec_dynamic}
921 0.015 0.000 0.015 0.000 {method 'sub' of 're.Pattern' objects}
613 0.013 0.000 0.013 0.000 {method 'findall' of 're.Pattern' objects}
572 0.011 0.000 0.011 0.000 {method 'read' of '_io.FileIO' objects}
1093 0.010 0.000 0.400 0.000 <frozen importlib._bootstrap_external>:1356(find_spec)
7 0.009 0.001 0.009 0.001 {built-in method posix.read}
4 0.009 0.002 0.009 0.002 {built-in method _posixsubprocess.fork_exec}
572 0.009 0.000 0.732 0.001 <frozen importlib._bootstrap_external>:793(get_code)
1 0.008 0.008 0.008 0.008 {method 'dot' of 'numpy.ndarray' objects}
211 0.008 0.000 0.008 0.000 sre_compile.py:276(_optimize_charset)
820 0.007 0.000 0.421 0.001 <frozen importlib._bootstrap>:882(_find_spec)
592 0.006 0.000 0.586 0.001 __init__.py:78(open_resource)
4 0.006 0.001 0.006 0.001 {built-in method _ctypes.dlopen}
8994 0.006 0.000 0.006 0.000 {built-in method builtins.hasattr}
Profiled this a bit today and on my end it seems like the issue is inside cudf._cuda.gpu.getDeviceCount()
.
Profiled this a bit today and on my end it seems like the issue is inside
cudf._cuda.gpu.getDeviceCount()
.
I doubt that problem is with that function. My guess is that's just the first function that invokes a CUDA runtime API which initializes the context and other first time setup stuff.
Profiled this a bit today and on my end it seems like the issue is inside
cudf._cuda.gpu.getDeviceCount()
.
Could you dump the profile here by any chance?
Here's the top piece of the profile.
785450 function calls (763821 primitive calls) in 5.156 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 3.047 3.047 3.047 3.047 {cudf._cuda.gpu.getDeviceCount}
120 0.358 0.003 0.358 0.003 {method 'read' of '_io.BufferedReader' objects}
945 0.293 0.000 0.307 0.000 <frozen importlib._bootstrap_external>:914(get_data)
5895 0.280 0.000 0.280 0.000 {built-in method posix.stat}
204/202 0.143 0.001 0.146 0.001 {built-in method _imp.create_dynamic}
945 0.109 0.000 0.109 0.000 {built-in method marshal.loads}
506 0.089 0.000 0.089 0.000 {built-in method posix.listdir}
204/130 0.045 0.000 0.161 0.001 {built-in method _imp.exec_dynamic}
2851/2818 0.038 0.000 0.133 0.000 {built-in method builtins.__build_class__}
4 0.025 0.006 0.025 0.006 {built-in method _ctypes.dlopen}
518 0.024 0.000 0.024 0.000 {built-in method builtins.compile}
1826 0.016 0.000 0.312 0.000 <frozen importlib._bootstrap_external>:1356(find_spec)
561 0.015 0.000 0.015 0.000 ffi.py:112(__call__)
132 0.015 0.000 0.015 0.000 {built-in method io.open}
1145 0.014 0.000 0.014 0.000 {method 'sub' of 're.Pattern' objects}
945 0.014 0.000 0.014 0.000 {method 'read' of '_io.FileIO' objects}
8 0.013 0.002 0.013 0.002 {built-in method posix.read}
805/193 0.013 0.000 0.035 0.000 sre_parse.py:469(_parse)
5 0.012 0.002 0.012 0.002 {built-in method _posixsubprocess.fork_exec}
945 0.011 0.000 0.460 0.000 <frozen importlib._bootstrap_external>:793(get_code)
614 0.010 0.000 0.012 0.000 sre_compile.py:276(_optimize_charset)
501 0.010 0.000 0.010 0.000 {method 'findall' of 're.Pattern' objects}
1215 0.009 0.000 0.333 0.000 <frozen importlib._bootstrap>:882(_find_spec)
16074/16069 0.008 0.000 0.008 0.000 {built-in method builtins.hasattr}
10147 0.008 0.000 0.020 0.000 <frozen importlib._bootstrap_external>:56(_path_join)
18447 0.008 0.000 0.009 0.000 {built-in method builtins.getattr}
1343/1 0.007 0.000 5.158 5.158 {built-in method builtins.exec}
1578 0.007 0.000 0.021 0.000 version.py:198(__init__)
114 0.007 0.000 0.021 0.000 __init__.py:1617(_get)
3039 0.007 0.000 0.010 0.000 <frozen importlib._bootstrap>:157(_get_module_lock)
5 0.007 0.001 0.007 0.001 {built-in method posix.waitpid}
3 0.007 0.002 0.008 0.003 six.py:1(<module>)
51125 0.007 0.000 0.008 0.000 {built-in method builtins.isinstance}
1890 0.007 0.000 0.015 0.000 <frozen importlib._bootstrap_external>:271(cache_from_source)
15643/15615 0.007 0.000 0.011 0.000 abstract.py:121(__hash__)
10147 0.007 0.000 0.009 0.000 <frozen importlib._bootstrap_external>:58(<listcomp>)
2850 0.006 0.000 0.006 0.000 {built-in method __new__ of type object at 0x55ed34c9f240}
349 0.006 0.000 0.007 0.000 templates.py:614(make_overload_template)
1239/1 0.006 0.000 5.158 5.158 <frozen importlib._bootstrap>:978(_find_and_load)
3039 0.006 0.000 0.006 0.000 <frozen importlib._bootstrap>:78(acquire)
1173/1 0.006 0.000 5.157 5.157 <frozen importlib._bootstrap>:663(_load_unlocked)
18919 0.006 0.000 0.006 0.000 sre_parse.py:233(__next)
1161 0.005 0.000 0.026 0.000 <frozen importlib._bootstrap>:504(_init_module_attrs)
1337/180 0.005 0.000 0.018 0.000 sre_compile.py:71(_compile)
1239/1 0.005 0.000 5.158 5.158 <frozen importlib._bootstrap>:948(_find_and_load_unlocked)
14699/14694 0.005 0.000 0.006 0.000 {method 'join' of 'str' objects}
40 0.005 0.000 0.023 0.001 castgraph.py:94(propagate)
3039 0.005 0.000 0.005 0.000 <frozen importlib._bootstrap>:103(release)
1149 0.005 0.000 0.006 0.000 <frozen importlib._bootstrap_external>:574(spec_from_file_location)
6132/741 0.004 0.000 1.570 0.002 <frozen importlib._bootstrap>:1009(_handle_fromlist)
31763 0.004 0.000 0.004 0.000 {method 'startswith' of 'str' objects}
10345 0.004 0.000 0.004 0.000 <frozen importlib._bootstrap>:222(_verbose_message)
1208 0.004 0.000 0.317 0.000 <frozen importlib._bootstrap_external>:1240(_get_spec)
16689 0.004 0.000 0.009 0.000 sre_parse.py:254(get)
42 0.004 0.000 0.008 0.000 enum.py:135(__new__)
77 0.004 0.000 0.009 0.000 __init__.py:316(namedtuple)
790 0.003 0.000 0.004 0.000 version.py:343(_cmpkey)
1607 0.003 0.000 0.003 0.000 {method 'search' of 're.Pattern' objects}
945 0.003 0.000 0.005 0.000 <frozen importlib._bootstrap_external>:438(_classify_pyc)
10024 0.003 0.000 0.003 0.000 {method 'rpartition' of 'str' objects}
945/1 0.003 0.000 5.157 5.157 <frozen importlib._bootstrap_external>:722(exec_module)
945 0.003 0.000 0.115 0.000 <frozen importlib._bootstrap_external>:523(_compile_bytecode)
2487 0.003 0.000 0.227 0.000 <frozen importlib._bootstrap_external>:84(_path_is_mode_type)
13657 0.003 0.000 0.006 0.000 {method 'get' of 'dict' objects}
23704/22175 0.003 0.000 0.003 0.000 {built-in method builtins.len}
2986/2855 0.003 0.000 0.004 0.000 {method 'format' of 'str' objects}
680 0.003 0.000 0.150 0.000 __init__.py:2126(distributions_from_metadata)
22762 0.003 0.000 0.003 0.000 {method 'rstrip' of 'str' objects}
9152 0.003 0.000 0.004 0.000 sre_parse.py:164(__getitem__)
5258 0.003 0.000 0.250 0.000 <frozen importlib._bootstrap_external>:74(_path_stat)
689 0.003 0.000 0.003 0.000 functions.py:87(__init__)
1149 0.003 0.000 0.010 0.000 <frozen importlib._bootstrap_external>:1351(_get_spec)
1161 0.003 0.000 0.005 0.000 <frozen importlib._bootstrap>:318(__exit__)
29445 0.003 0.000 0.003 0.000 {method 'append' of 'list' objects}
17714 0.003 0.000 0.003 0.000 abstract.py:99(key)
1161/1158 0.003 0.000 0.177 0.000 <frozen importlib._bootstrap>:576(module_from_spec)
346 0.003 0.000 0.063 0.000 __init__.py:2585(from_location)
1568/435 0.003 0.000 0.003 0.000 sre_parse.py:174(getwidth)
501 0.003 0.000 0.029 0.000 textwrap.py:414(dedent)
1154 0.003 0.000 0.069 0.000 re.py:271(_compile)
2 0.002 0.001 0.002 0.001 {method 'poll' of 'select.poll' objects}
3626 0.002 0.000 0.004 0.000 typing.py:704(__setattr__)
776 0.002 0.000 0.004 0.000 functools.py:37(update_wrapper)
7138 0.002 0.000 0.003 0.000 {built-in method builtins.setattr}
788 0.002 0.000 0.006 0.000 version.py:131(_legacy_cmpkey)
1800 0.002 0.000 0.010 0.000 <frozen importlib._bootstrap>:194(_lock_unlock_module)
131 0.002 0.000 0.003 0.000 templates.py:816(make_overload_attribute_template)
923 0.002 0.000 0.003 0.000 posixpath.py:75(join)
18763/18019 0.002 0.000 0.002 0.000 {built-in method builtins.hash}
1149 0.002 0.000 0.012 0.000 <frozen importlib._bootstrap_external>:369(_get_cached)
150 0.002 0.000 0.005 0.000 <frozen importlib._bootstrap_external>:1190(_path_hooks)
945 0.002 0.000 0.002 0.000 {built-in method _imp._fix_co_filename}
3092 0.002 0.000 0.003 0.000 {built-in method builtins.any}
2519 0.002 0.000 0.003 0.000 <frozen importlib._bootstrap>:416(parent)
4454 0.002 0.000 0.002 0.000 {method 'match' of 're.Pattern' objects}
1890 0.002 0.000 0.003 0.000 <frozen importlib._bootstrap_external>:62(_path_split)
2032 0.002 0.000 0.007 0.000 castgraph.py:41(insert)
580/180 0.002 0.000 0.036 0.000 sre_parse.py:411(_parse_sub)
481 0.002 0.000 0.007 0.000 typing.py:603(__init__)
3434 0.002 0.000 0.003 0.000 version.py:114(_parse_version_parts)
8269 0.002 0.000 0.002 0.000 {method 'group' of 're.Match' objects}
2835 0.002 0.000 0.003 0.000 <frozen importlib._bootstrap_external>:51(_r_long)
2094 0.002 0.000 0.014 0.000 <frozen importlib._bootstrap>:403(cached)
454 0.002 0.000 0.019 0.000 __init__.py:1327(safe_version)
1226 0.002 0.000 0.002 0.000 <frozen importlib._bootstrap>:176(cb)
2306/2299 0.002 0.000 0.002 0.000 abstract.py:124(__eq__)
3816 0.002 0.000 0.002 0.000 <frozen importlib._bootstrap>:859(__exit__)
960 0.001 0.000 0.004 0.000 typing.py:113(_type_check)
612 0.001 0.000 0.002 0.000 _inspect.py:67(getargs)
4435 0.001 0.000 0.001 0.000 {method 'endswith' of 'str' objects}
2215 0.001 0.000 0.227 0.000 <frozen importlib._bootstrap_external>:93(_path_isfile)
706/50 0.001 0.000 1.575 0.031 {built-in method builtins.__import__}
1226 0.001 0.000 0.002 0.000 <frozen importlib._bootstrap>:58(__init__)
680/342 0.001 0.000 0.002 0.000 {built-in method _abc._abc_subclasscheck}
1025 0.001 0.000 0.001 0.000 {method 'split' of 're.Pattern' objects}
75 0.001 0.000 1.800 0.024 __init__.py:1(<module>)
1239 0.001 0.000 0.012 0.000 <frozen importlib._bootstrap>:147(__enter__)
152/132 0.001 0.000 0.023 0.000 {built-in method builtins.sorted}
1410/1302 0.001 0.000 0.018 0.000 typing.py:248(inner)
4300 0.001 0.000 0.002 0.000 version.py:65(_compare)
818/815 0.001 0.000 0.009 0.000 abstract.py:63(__call__)
945 0.001 0.000 0.002 0.000 <frozen importlib._bootstrap_external>:471(_validate_timestamp_pyc)
6786 0.001 0.000 0.001 0.000 {built-in method posix.fspath}
5184 0.001 0.000 0.001 0.000 {built-in method builtins.min}
8223 0.001 0.000 0.001 0.000 {built-in method _imp.acquire_lock}
1910 0.001 0.000 0.007 0.000 <frozen importlib._bootstrap_external>:1203(_path_importer_cache)
1388 0.001 0.000 0.001 0.000 <frozen importlib._bootstrap>:369(__init__)
346 0.001 0.000 0.003 0.000 __init__.py:686(add)
279 0.001 0.000 0.001 0.000 {built-in method _abc._abc_init}
1208 0.001 0.000 0.318 0.000 <frozen importlib._bootstrap_external>:1272(find_spec)
950 0.001 0.000 0.001 0.000 <frozen importlib._bootstrap>:35(_new_module)
3118 0.001 0.000 0.001 0.000 version.py:207(<genexpr>)
454 0.001 0.000 0.003 0.000 version.py:236(__str__)
886 0.001 0.000 0.002 0.000 __init__.py:2644(key)
2921 0.001 0.000 0.001 0.000 {built-in method from_bytes}
286 0.001 0.000 0.005 0.000 overrides.py:72(verify_matching_signatures)
1768/1 0.001 0.000 5.156 5.156 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
8223 0.001 0.000 0.001 0.000 {built-in method _imp.release_lock}
818 0.001 0.000 0.004 0.000 abstract.py:51(_intern)
945 0.001 0.000 0.001 0.000 <frozen importlib._bootstrap_external>:401(_check_name_wrapper)
590 0.001 0.000 0.041 0.000 __init__.py:2772(_get_metadata)
496 0.001 0.000 0.002 0.000 posixpath.py:154(dirname)
349 0.001 0.000 0.002 0.000 extending.py:57(overload)
584 0.001 0.000 0.002 0.000 enum.py:70(__setitem__)
344/238 0.001 0.000 0.009 0.000 typing.py:340(__getitem__)
3626 0.001 0.000 0.002 0.000 typing.py:590(_is_dunder)
66 0.001 0.000 0.001 0.000 templates.py:667(make_intrinsic_template)
3816 0.001 0.000 0.001 0.000 <frozen importlib._bootstrap>:855(__enter__)
349 0.001 0.000 0.012 0.000 extending.py:111(decorate)
1124 0.001 0.000 0.015 0.000 version.py:24(parse)
62 0.001 0.000 0.001 0.000 {built-in method cupy.core._kernel.create_ufunc}
360 0.001 0.000 0.180 0.000 __init__.py:2039(find_on_path)
1708 0.001 0.000 0.002 0.000 __init__.py:2071(dist_factory)
180 0.001 0.000 0.064 0.000 sre_compile.py:759(compile)
945 0.001 0.000 0.011 0.000 <frozen importlib._bootstrap_external>:951(path_stats)
1533/1102 0.001 0.000 0.001 0.000 typing.py:664(__hash__)
6095 0.001 0.000 0.001 0.000 {built-in method _thread.get_ident}
1161 0.001 0.000 0.001 0.000 <frozen importlib._bootstrap>:311(__enter__)
471 0.001 0.000 0.002 0.000 copy.py:268(_reconstruct)
2268 0.001 0.000 0.001 0.000 {method 'split' of 'str' objects}
150 0.001 0.000 0.025 0.000 <frozen importlib._bootstrap_external>:1404(_fill_cache)
8 0.001 0.000 0.001 0.000 {method 'read' of '_io.TextIOWrapper' objects}
2032 0.001 0.000 0.003 0.000 castgraph.py:50(get)
686 0.001 0.000 0.001 0.000 genericpath.py:121(_splitext)
2044 0.001 0.000 0.002 0.000 castgraph.py:67(__getitem__)
4297 0.001 0.000 0.001 0.000 sre_parse.py:249(match)
1239 0.001 0.000 0.004 0.000 <frozen importlib._bootstrap>:151(__exit__)
472 0.001 0.000 0.003 0.000 copy.py:66(copy)
316 0.001 0.000 0.026 0.000 overrides.py:166(decorator)
180 0.001 0.000 0.006 0.000 sre_compile.py:536(_compile_info)
1766 0.001 0.000 0.001 0.000 {method 'update' of 'dict' objects}
2247 0.001 0.000 0.001 0.000 {method 'rfind' of 'str' objects}
3058 0.001 0.000 0.002 0.000 {method 'add' of 'set' objects}
1208 0.001 0.000 0.001 0.000 {built-in method _imp.is_frozen}
807 0.001 0.000 0.002 0.000 ffi.py:56(__getattr__)
2229 0.001 0.000 0.003 0.000 abstract.py:127(__ne__)
355 0.001 0.000 0.001 0.000 ntpath.py:122(splitdrive)
686 0.001 0.000 0.002 0.000 posixpath.py:121(splitext)
213 0.001 0.000 0.001 0.000 {method 'splitlines' of 'str' objects}
335 0.001 0.000 0.001 0.000 pyparsing.py:1144(__init__)
4642 0.001 0.000 0.001 0.000 <frozen importlib._bootstrap>:321(<genexpr>)
4 0.001 0.000 0.001 0.000 {method 'readlines' of '_io._IOBase' objects}
1239 0.001 0.000 0.001 0.000 <frozen importlib._bootstrap>:143(__init__)
630 0.001 0.000 0.001 0.000 enum.py:376(__setattr__)
1080 0.001 0.000 0.001 0.000 {method 'extend' of 'list' objects}
355 0.001 0.000 0.003 0.000 __init__.py:1494(_validate_resource_path)
150 0.001 0.000 0.002 0.000 <frozen importlib._bootstrap_external>:1319(__init__)
614 0.001 0.000 0.001 0.000 sre_compile.py:249(_compile_charset)
2172 0.001 0.000 0.001 0.000 posixpath.py:41(_get_sep)
353 0.001 0.000 0.001 0.000 posixpath.py:144(basename)
945 0.001 0.000 0.001 0.000 <frozen importlib._bootstrap_external>:884(__init__)
2594 0.001 0.000 0.001 0.000 sre_parse.py:172(append)
616 0.001 0.000 0.003 0.000 _inspect.py:98(getargspec)
Was this on a DGX-1 as previously reported?
The first CUDA call will initialise the context, as Jake pointed out. The time will be higher on multiGPU machines. There is also a cost that is proportional to the sysmem size, I believe.
Loading all the libcudf device code (currently near 300MB) to each GPU is also a substantial startup cost. Eliminating legacy should help this. Also we could either move to device-side dispatch where it makes sense (binops, reductions), and/or do more JIT.
@harrism running import cudf
doesn't actually initialize a CUDA context or copy the device code to device memory. It does initialize the driver in validating the GPU architecture / driver version / toolkit version and it looks like that's taking ~3s per above. There's an additional ~2s just in Python imports but I'm not seeing anything stick out in the profile above that's immediately actionable.
What does getDeviceCount
call?
What does
getDeviceCount
call?
cudaGetDeviceCount
: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g18808e54893cfcaafefeab31a73cc55f
This apparently doesn't actually create a context though.
This was on a DGX-1. I have noticed that subsequent imports don't take nearly as long. Excluding the three seconds up front to call that cuda function, my profile looks fairly similar to the original one in this thread. So I'm actually not sure if 3 seconds spent inside getDeviceCount
is what the issue was originally raised about, it might have been the rest of the time spent modulo the cuda work.
This apparently doesn't actually create a context though.
Where do the docs say that?
This apparently doesn't actually create a context though.
Where do the docs say that?
They don't, an internal slack thread @jrhemstad had with CUDA developers confirmed it though. I can try to dig up a link if you're interested.
OK, then next question: can we find out if all of the 3s in the profile for that call are in the CUDA runtime/driver? (i.e. is there any Numba overhead?)
OK, then next question: can we find out if all of the 3s in the profile for that call are in the CUDA runtime/driver? (i.e. is there any Numba overhead?)
These aren't called via Numba, we wrote our own Cython bindings here where there should be basically zero overhead.
Sorry, misunderstood. Well, I'm guessing the runtime is doing some sort of initialization at that call...
Sorry, misunderstood. Well, I'm guessing the runtime is doing some sort of initialization at that call...
From my understanding it does the driver API initialization that happens implicitly when a context is created. Basically a cuInit(0)
call. https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__INITIALIZE.html
Can you try compiling this? Compile this code:
#include <stdio.h>
#include <chrono>
int main()
{
int numgpus=0;
printf("starting\n");
for (int i = 0; i < 2; i++) {
auto start = std::chrono::high_resolution_clock::now();
cudaGetDeviceCount(&numgpus);
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
printf("iter: %d: GPUs: %d, elapsed time: %lf\n", i, numgpus, elapsed_seconds.count());
}
return 0;
}
Using nvcc test.cu
When I run on a DGX I get the following. Second result is limiting visible devices to a single GPU (it's faster).
(base) mharris@dgx02:~$ ./a.out
starting
iter: 0: GPUs: 8, elapsed time: 0.394263
iter: 1: GPUs: 8, elapsed time: 0.000000
(base) mharris@dgx02:~$ CUDA_VISIBLE_DEVICES=3 ./a.out
starting
iter: 0: GPUs: 1, elapsed time: 0.079943
iter: 1: GPUs: 1, elapsed time: 0.000000
So with 8 GPUs it takes .39s -- not 3s, but still significant. And the second time it takes less than a microsecond. It must be doing something. I think that something is initializing the driver context. And if libcudf is linked into the module, then its 300MB fatbin is probably being loaded.
You can try on your DGX to see if its different.
And if libcudf is linked into the module, then its 300MB fatbin is probably being loaded.
libcudf
is absolutely linked into the module.
There were some untyped variables in those Cython bindings. Doubt that matters much, but it may affect this a little bit and is easy to fix. Sent PR ( https://github.com/rapidsai/cudf/pull/4925 ) to type them.
Reworked the original proposal as there were some device functions that needed some changes to handle their errors respectively. Otherwise largely the same idea. Submitted as PR ( https://github.com/rapidsai/cudf/pull/4943 ).
Retested this and it appears to be ~2s now. Longer or shorter depending on the run. Occasionally see an outlier.
Retested this and it appears to be ~2s now. Longer or shorter depending on the run. Occasionally see an outlier.
What kind of machine did you test this on? I suspect this scales somewhat linearly with the number of visible GPUs due to the driver initialization.
DGX-1. Were the previous measurements all on DGX-1s or did some of them use other machines?
FWIW the import time appears to be the same if I restrict to a single device. Tested the following on a DGX-1.
All GPUs:
$ ipython
Python 3.7.6 | packaged by conda-forge | (default, Mar 23 2020, 23:03:20)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.14.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: %time import cudf
CPU times: user 4.28 s, sys: 4.02 s, total: 8.29 s
Wall time: 2.12 s
In [2]: cudf._cuda.gpu.getDeviceCount()
Out[2]: 8
One GPU:
$ CUDA_VISIBLE_DEVICES=0 ipython
Python 3.7.6 | packaged by conda-forge | (default, Mar 23 2020, 23:03:20)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.14.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: %time import cudf
CPU times: user 4.28 s, sys: 3.57 s, total: 7.85 s
Wall time: 2.14 s
In [2]: cudf._cuda.gpu.getDeviceCount()
Out[2]: 1
Again there is some variability above and below 2s. However this is representative of what I'm seeing.
WIth 0.14 and legacy removed, I thought this might have improved... I tried on my linux desktop with a single V100 GPU and importing cudf took 776 ms. But I also tried running in a container on a DGX-1 with only one of the V100s visible and it took 15.5s! Could being in a container make it slow?
WIth 0.14 and legacy removed, I thought this might have improved... I tried on my linux desktop with a single V100 GPU and importing cudf took 776 ms. But I also tried running in a container on a DGX-1 with only one of the V100s visible and it took 15.5s! Could being in a container make it slow?
That may be that the DGX didn't have persistence mode enabled and it took a long time for the driver to initialize because the GPUs had to "wake up".
Persistence mode is on.
More data points:
%time from numba import cuda
: 1.04s%time import rmm
: 1.29sI tried importing cudf on the same DGX outside of my rapids-compose container, but presumably my installation isn't quite right because I'm building from source, and it couldn't find a numba submodule. HOWEVER, before it hit that error, I counted for about 12 seconds. That numba loading was inside RMM, which is imported at the beginning of cuDF's __init__.py
.
Not sure if that tells us anything.
It adds up on real rapids codes, e.g., 4s
for an empty cudf df
Some observations:
pandas
is 1+ s
on its own and maintainers seem disinterested: https://github.com/pandas-dev/pandas/issues/7282#issuecomment-442637703
cudf
is now 2.7+ s
import, and add another 1-2s
for even an empty df . So 1s
blameable on pandas, but the other ~ 3s+
seems fair game..
dask_cudf
adds another 0.5+ s
for import, and I didn't measure importing a client and connecting to an existing cluster, but presumably much more
We've been chasing down why we're seeing 20s-60s starts for a boring rapids web app init that is just setting up routes and running some cudf + udf warmup routines, so thought this may help. It's been frustrating for production b/c slows down first start + autorestarts. More nice-to-have for us, but maybe more important for others, are of course local dev + testing, and longer-term, precluding fast cold starts for GPU FaaS.
A few more numbers:
(rapids) root@8128f9b31ecd:/opt/graphistry/apps/forge/etl-server-python# time python -c "1"
real 0m0.047s
user 0m0.043s
sys 0m0.004s
(rapids) root@8128f9b31ecd:/opt/graphistry/apps/forge/etl-server-python# time python -c "import logging"
real 0m0.080s
user 0m0.068s
sys 0m0.012s
(rapids) root@8128f9b31ecd:/opt/graphistry/apps/forge/etl-server-python# time python -c "import pandas"
real 0m1.011s
user 0m1.059s
sys 0m0.626s
(rapids) root@8128f9b31ecd:/opt/graphistry/apps/forge/etl-server-python# time python -c "import cudf"
real 0m2.692s
user 0m2.607s
sys 0m0.811s
(rapids) root@8128f9b31ecd:/opt/graphistry/apps/forge/etl-server-python# time python -c "import cudf; cudf.DataFrame({'x': []})"
real 0m3.558s
user 0m3.211s
sys 0m1.060s
(rapids) root@8128f9b31ecd:/opt/graphistry/apps/forge/etl-server-python# time python -c "import dask_cudf; import cudf; cudf.DataFrame({'x': []})"
real 0m4.015s
user 0m3.547s
sys 0m0.794s
And related to the above:
(rapids) root@8128f9b31ecd:/opt/graphistry/apps/forge/etl-server-python# time python -c "from numba import cuda"
real 0m1.233s
user 0m1.218s
sys 0m0.757s
(rapids) root@8128f9b31ecd:/opt/graphistry/apps/forge/etl-server-python# time python -c "import rmm"
real 0m1.678s
user 0m1.688s
sys 0m0.738s
I spent a little time with profiling the import with snakeviz. gpu_utils is taking close to ~2.5 seconds and a big chunk of that is from getDeviceCount
. @shwina do you have any thoughts here ?
Hmm - nothing sticks out to me in the implementation: https://github.com/NVIDIA/cuda-python/blob/746b773c91e1ede708fe9a584b8cdb1c0f32b51d/cuda/_lib/ccudart/ccudart.pyx#L1458-L1466
I suspect the overhead we're seeing here is from the initialization of the CUDA context.
I can't think of a great way around this. It sounds like the lower bound to our import time is the time for CUDA context initialization, which apparently scales as the square of the number of GPUs.
validate_setup
sounds like something we definitely want to do at import time (not later), but we could potentially expose something like an environment variable that allows skipping the check altogether? Would that be useful?
Note that regardless, the CUDA context will eventually need to be initialized at some point if we're going to use the GPU for anything useful.
Googling around suggests C++-level cuda context creation is expected to be closer to ~100ms
than 1s+
. We're all python nowadays so I don't have any C++ handy for seeing if that's still true or maybe there's a different notion of context. (My tests are single GPU, not DGX.)
Maybe also something about how cpython links stuff at runtime? I've been thinking we might have python import issues around path search or native imports / runtime linking, but haven't dug deep on that.
Are your workloads being run on the cloud? If so, it is possible that these are being run on multi-tenant nodes, which have more GPUs than are actually given to any one user. If this is the case, something to play with potentially would be restricting the CUDA_VISIBLE_DEVICES
as done above to see if that has an effect. This would provide some clue as to whether or not this is the issue.
I've seen for various local single GPUs, not just cloud: Local, windows -> wsl2 -> ubuntu docker -> rtx3080, local ubuntu docker -> rtx3080, and local ubuntu docker -> some old geforce.
Happy to test any cpp/py...
@vyasr Is there a reasonable way to add cudf import time tracking to the python benchmarks?
I think that would be valuable. The issue is that a lot of the slow parts of the import have to do with initialization logic that only occurs the first time that you import cudf. For example:
In [1]: %timeit import cudf
The slowest run took 22.42 times longer than the fastest. This could mean that an intermediate result is being cached.
670 ns ± 1.19 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [2]: %timeit import cudf
77 ns ± 0.189 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
Since cudf gets imported as part of the collection process for pytest and at the top every module, we'd be obscuring most of that overhead if we tried to incorporate it as part of the same benchmarking suite. If we want to benchmark this, we should just have a separate and very simple script that just times the import and exits.
If we really care about accuracy it would be a bit more involved. We would need the script to do something like launch subprocesses (probably serially to avoid any contention issues around NVIDIA driver context creation) that each run the import command and then collect the results. I don't think we need to go that far, though. Even very rough timings will probably tell us what we need.
Something we might also consider is using this lazy loader framework
I suspect the addition of CUDA Lazy Loading should dramatically improve this situation. This was added in 11.7 and has been improved in subsequent releases.
This is currently an optional feature and can be enabled by specifying the environment variable CUDA_MODULE_LOADING=LAZY
at runtime.
Agreed, CUDA lazy loading should help, although not sure how much. Last I checked import time was distributed across dependencies as well (numba, cupy, rmm) so we'll need to do some careful profiling again with lazy loading enabled to see what else those libraries are doing and how slow other parts of their import are (as well as the validation that cudf does on import, which adds significant overhead).
I had a brief look with pyinstrument:
$ pyinstrument importcudf.py
_ ._ __/__ _ _ _ _ _/_ Recorded: 16:40:36 Samples: 1201
/_//_/// /_\ / //_// / //_'/ // Duration: 2.298 CPU time: 1.758
/ _/ v4.4.0
Program: importcudf.py
2.297 <module> importcudf.py:1
└─ 2.297 <module> cudf/__init__.py:1
├─ 1.064 validate_setup cudf/utils/gpu_utils.py:4
│ ├─ 0.977 <module> rmm/__init__.py:1
│ │ ├─ 0.613 <module> rmm/rmm.py:1
│ │ │ ├─ 0.350 <module> torch/__init__.py:1
│ │ │ │ [617 frames hidden] torch, <built-in>, typing, collection...
│ │ │ └─ 0.261 <module> cupy/__init__.py:1
│ │ │ [668 frames hidden] cupy, pytest, _pytest, attr, <built-i...
│ │ ├─ 0.330 <module> rmm/mr.py:1
│ │ │ └─ 0.330 <module> rmm/_lib/__init__.py:1
│ │ │ ├─ 0.280 <module> numba/__init__.py:1
│ │ │ │ [739 frames hidden] numba, numpy, re, sre_compile, sre_pa...
│ │ │ └─ 0.047 <module> numba/cuda/__init__.py:1
│ │ │ [163 frames hidden] numba, asyncio, ssl, enum, <built-in>...
│ │ └─ 0.034 get_versions rmm/_version.py:515
│ │ └─ 0.034 git_pieces_from_vcs rmm/_version.py:234
│ │ └─ 0.034 run_command rmm/_version.py:71
│ │ └─ 0.027 Popen.communicate subprocess.py:1110
│ │ [7 frames hidden] subprocess, <built-in>, selectors
│ └─ 0.031 getDeviceCount rmm/_cuda/gpu.py:91
├─ 0.390 _setup_numba_linker cudf/core/udf/utils.py:388
│ └─ 0.390 safe_get_versions ptxcompiler/patch.py:212
│ [12 frames hidden] ptxcompiler, subprocess, selectors, <...
├─ 0.308 <module> cudf/api/__init__.py:1
│ └─ 0.301 <module> cudf/api/extensions/__init__.py:1
│ └─ 0.301 <module> cudf/api/extensions/accessor.py:1
│ └─ 0.300 <module> pandas/__init__.py:1
│ [694 frames hidden] pandas, pyarrow, copy, <built-in>, te...
├─ 0.280 <module> cudf/datasets.py:1
│ ├─ 0.253 <module> cudf/_lib/__init__.py:1
│ │ ├─ 0.124 <module> cudf/utils/ioutils.py:1
│ │ │ ├─ 0.096 <module> pyarrow/fs.py:1
│ │ │ │ [4 frames hidden] pyarrow, <built-in>
│ │ │ └─ 0.026 <module> fsspec/__init__.py:1
│ │ │ [79 frames hidden] fsspec, importlib, pathlib, <built-in...
│ │ ├─ 0.079 [self] None
│ │ └─ 0.043 <module> cudf/_lib/strings/__init__.py:1
│ └─ 0.024 <module> cudf/core/column_accessor.py:1
│ └─ 0.024 <module> cudf/core/column/__init__.py:1
├─ 0.135 <module> cudf/core/algorithms.py:1
│ ├─ 0.097 <module> cudf/core/indexed_frame.py:1
│ │ └─ 0.090 <module> cudf/core/groupby/__init__.py:1
│ │ └─ 0.090 <module> cudf/core/groupby/groupby.py:1
│ │ ├─ 0.053 <module> cudf/core/udf/groupby_utils.py:1
│ │ │ └─ 0.053 _get_ptx_file cudf/core/udf/utils.py:291
│ │ │ └─ 0.053 get_current_device numba/cuda/api.py:433
│ │ │ [10 frames hidden] numba
│ │ └─ 0.031 <module> cudf/core/udf/__init__.py:1
│ │ └─ 0.027 <module> cudf/core/udf/row_function.py:1
│ └─ 0.031 <module> cudf/core/index.py:1
│ └─ 0.025 Float32Index.__init_subclass__ cudf/core/mixins/mixin_factory.py:212
│ └─ 0.023 TimedeltaIndex._should_define_operation cudf/core/mixins/mixin_factory.py:77
└─ 0.086 get_versions cudf/_version.py:515
└─ 0.085 git_pieces_from_vcs cudf/_version.py:235
└─ 0.085 run_command cudf/_version.py:72
└─ 0.077 Popen.communicate subprocess.py:1110
[7 frames hidden] subprocess, <built-in>, selectors
To view this report with different options, run:
pyinstrument --load-prev 2023-02-16T16-40-36 [options]
CUDA_MODULE_LOADING=LAZY
doesn't really seem to help (I have host driver == cuda-12, and cuda-11.8 in a rapids-compose environment).
Adding RAPIDS_NO_INITIALIZE=1
just moves some time around
2.277 <module> importcudf.py:1
└─ 2.277 <module> cudf/__init__.py:1
├─ 0.453 <module> rmm/__init__.py:1
│ ├─ 0.352 <module> rmm/rmm.py:1
│ │ └─ 0.351 <module> torch/__init__.py:1
│ │ [663 frames hidden] torch, <built-in>, typing, enum, pick...
│ ├─ 0.067 <module> rmm/mr.py:1
│ │ └─ 0.067 <module> rmm/_lib/__init__.py:1
│ │ └─ 0.045 __new__ enum.py:180
│ │ [15 frames hidden] enum, <built-in>
│ └─ 0.034 get_versions rmm/_version.py:515
│ └─ 0.034 git_pieces_from_vcs rmm/_version.py:234
│ └─ 0.034 run_command rmm/_version.py:71
│ └─ 0.027 Popen.communicate subprocess.py:1110
│ [7 frames hidden] subprocess, <built-in>, selectors
├─ 0.407 _setup_numba_linker cudf/core/udf/utils.py:388
│ └─ 0.407 safe_get_versions ptxcompiler/patch.py:212
│ [11 frames hidden] ptxcompiler, subprocess, selectors, <...
├─ 0.364 <module> cupy/__init__.py:1
│ [1068 frames hidden] cupy, pytest, _pytest, <built-in>, at...
├─ 0.356 <module> cudf/api/__init__.py:1
│ └─ 0.348 <module> cudf/api/extensions/__init__.py:1
│ └─ 0.348 <module> cudf/api/extensions/accessor.py:1
│ └─ 0.346 <module> pandas/__init__.py:1
│ [700 frames hidden] pandas, pyarrow, <built-in>, textwrap...
├─ 0.286 <module> cudf/datasets.py:1
│ ├─ 0.258 <module> cudf/_lib/__init__.py:1
│ │ ├─ 0.126 <module> cudf/utils/ioutils.py:1
│ │ │ ├─ 0.097 <module> pyarrow/fs.py:1
│ │ │ │ [4 frames hidden] pyarrow, <built-in>
│ │ │ └─ 0.027 <module> fsspec/__init__.py:1
│ │ │ [86 frames hidden] fsspec, importlib, pathlib, <built-in...
│ │ ├─ 0.084 [self] None
│ │ └─ 0.044 <module> cudf/_lib/strings/__init__.py:1
│ └─ 0.025 <module> cudf/core/column_accessor.py:1
│ └─ 0.024 <module> cudf/core/column/__init__.py:1
├─ 0.183 <module> numba/__init__.py:1
│ [495 frames hidden] numba, llvmlite, ctypes, <built-in>, ...
├─ 0.085 get_versions cudf/_version.py:515
│ └─ 0.084 git_pieces_from_vcs cudf/_version.py:235
│ └─ 0.084 run_command cudf/_version.py:72
│ └─ 0.076 Popen.communicate subprocess.py:1110
│ [7 frames hidden] subprocess, <built-in>, selectors
├─ 0.081 <module> cudf/core/algorithms.py:1
│ ├─ 0.042 <module> cudf/core/indexed_frame.py:1
│ │ └─ 0.036 <module> cudf/core/groupby/__init__.py:1
│ │ └─ 0.036 <module> cudf/core/groupby/groupby.py:1
│ │ └─ 0.031 <module> cudf/core/udf/__init__.py:1
│ │ └─ 0.027 <module> cudf/core/udf/row_function.py:1
│ └─ 0.032 <module> cudf/core/index.py:1
│ └─ 0.024 UInt64Index.__init_subclass__ cudf/core/mixins/mixin_factory.py:212
│ └─ 0.024 UInt64Index._should_define_operation cudf/core/mixins/mixin_factory.py:77
│ └─ 0.024 dir None
│ [2 frames hidden] <built-in>
└─ 0.026 <module> numba/cuda/__init__.py:1
[86 frames hidden] numba, <built-in>, collections, llvml...
Numba forks a subprocess to compile some cuda code to determine if it needs to patch the ptx linker. We can turn that off
1.895 <module> importcudf.py:1
└─ 1.895 <module> cudf/__init__.py:1
├─ 0.468 <module> rmm/__init__.py:1
│ ├─ 0.359 <module> rmm/rmm.py:1
│ │ └─ 0.358 <module> torch/__init__.py:1
│ │ [687 frames hidden] torch, <built-in>, typing, enum, insp...
│ ├─ 0.073 <module> rmm/mr.py:1
│ │ └─ 0.072 <module> rmm/_lib/__init__.py:1
│ │ └─ 0.052 __new__ enum.py:180
│ │ [23 frames hidden] enum, <built-in>
│ └─ 0.036 get_versions rmm/_version.py:515
│ └─ 0.036 git_pieces_from_vcs rmm/_version.py:234
│ └─ 0.036 run_command rmm/_version.py:71
│ └─ 0.029 Popen.communicate subprocess.py:1110
│ [7 frames hidden] subprocess, <built-in>, selectors
├─ 0.363 <module> cupy/__init__.py:1
│ [1034 frames hidden] cupy, pytest, _pytest, <built-in>, at...
├─ 0.362 <module> cudf/api/__init__.py:1
│ └─ 0.354 <module> cudf/api/extensions/__init__.py:1
│ └─ 0.354 <module> cudf/api/extensions/accessor.py:1
│ └─ 0.353 <module> pandas/__init__.py:1
│ [722 frames hidden] pandas, pyarrow, textwrap, <built-in>...
├─ 0.284 <module> cudf/datasets.py:1
│ ├─ 0.257 <module> cudf/_lib/__init__.py:1
│ │ ├─ 0.127 <module> cudf/utils/ioutils.py:1
│ │ │ ├─ 0.098 <module> pyarrow/fs.py:1
│ │ │ │ [4 frames hidden] pyarrow, <built-in>
│ │ │ └─ 0.028 <module> fsspec/__init__.py:1
│ │ │ [97 frames hidden] fsspec, importlib, pathlib, <built-in...
│ │ ├─ 0.082 [self] None
│ │ └─ 0.043 <module> cudf/_lib/strings/__init__.py:1
│ └─ 0.024 <module> cudf/core/column_accessor.py:1
│ └─ 0.024 <module> cudf/core/column/__init__.py:1
├─ 0.194 <module> numba/__init__.py:1
│ [490 frames hidden] numba, llvmlite, ctypes, <built-in>, ...
├─ 0.081 <module> cudf/core/algorithms.py:1
│ ├─ 0.043 <module> cudf/core/indexed_frame.py:1
│ │ └─ 0.036 <module> cudf/core/groupby/__init__.py:1
│ │ └─ 0.036 <module> cudf/core/groupby/groupby.py:1
│ │ └─ 0.031 <module> cudf/core/udf/__init__.py:1
│ │ └─ 0.027 <module> cudf/core/udf/row_function.py:1
│ │ └─ 0.021 <module> cudf/core/udf/utils.py:1
│ └─ 0.032 <module> cudf/core/index.py:1
│ └─ 0.024 GenericIndex.__init_subclass__ cudf/core/mixins/mixin_factory.py:212
│ └─ 0.022 GenericIndex._should_define_operation cudf/core/mixins/mixin_factory.py:77
│ └─ 0.020 dir None
│ [2 frames hidden] <built-in>
├─ 0.078 get_versions cudf/_version.py:515
│ └─ 0.078 git_pieces_from_vcs cudf/_version.py:235
│ └─ 0.078 run_command cudf/_version.py:72
│ └─ 0.071 Popen.communicate subprocess.py:1110
│ [7 frames hidden] subprocess, <built-in>, selectors
├─ 0.029 <module> numba/cuda/__init__.py:1
│ [110 frames hidden] numba, collections, re, types, <built...
└─ 0.022 <module> cudf/io/__init__.py:1
RMM imports pytorch if it's in the environment
try:
from torch.cuda.memory import CUDAPluggableAllocator
except ImportError:
rmm_torch_allocator = None
else:
import rmm._lib.torch_allocator
_alloc_free_lib_path = rmm._lib.torch_allocator.__file__
rmm_torch_allocator = CUDAPluggableAllocator(
_alloc_free_lib_path,
alloc_fn_name="allocate",
free_fn_name="deallocate",
)
Let's suppose we fix that with lazy-loading.
1.469 <module> importcudf.py:1
└─ 1.467 <module> cudf/__init__.py:1
├─ 0.360 <module> cupy/__init__.py:1
│ [1023 frames hidden] cupy, pytest, _pytest, <built-in>, at...
├─ 0.282 <module> cudf/api/__init__.py:1
│ └─ 0.274 <module> cudf/api/extensions/__init__.py:1
│ └─ 0.274 <module> cudf/api/extensions/accessor.py:1
│ └─ 0.272 <module> pandas/__init__.py:1
│ [695 frames hidden] pandas, pyarrow, <built-in>, textwrap...
├─ 0.275 <module> cudf/datasets.py:1
│ ├─ 0.249 <module> cudf/_lib/__init__.py:1
│ │ ├─ 0.125 <module> cudf/utils/ioutils.py:1
│ │ │ ├─ 0.096 <module> pyarrow/fs.py:1
│ │ │ │ [4 frames hidden] pyarrow, <built-in>
│ │ │ └─ 0.028 <module> fsspec/__init__.py:1
│ │ │ [91 frames hidden] fsspec, importlib, pathlib, <built-in...
│ │ ├─ 0.077 [self] None
│ │ └─ 0.042 <module> cudf/_lib/strings/__init__.py:1
│ └─ 0.025 <module> cudf/core/column_accessor.py:1
│ └─ 0.024 <module> cudf/core/column/__init__.py:1
├─ 0.182 <module> numba/__init__.py:1
│ [446 frames hidden] numba, llvmlite, ctypes, <built-in>, ...
├─ 0.118 <module> cudf/core/algorithms.py:1
│ ├─ 0.070 <module> cudf/core/index.py:1
│ │ ├─ 0.041 <module> cudf/core/frame.py:1
│ │ │ └─ 0.039 Frame cudf/core/frame.py:51
│ │ │ └─ 0.039 _cudf_nvtx_annotate cudf/utils/utils.py:385
│ │ │ └─ 0.039 annotate.__call__ nvtx/nvtx.py:89
│ │ │ [2 frames hidden] nvtx
│ │ └─ 0.025 Float64Index.__init_subclass__ cudf/core/mixins/mixin_factory.py:212
│ │ └─ 0.024 Float64Index._should_define_operation cudf/core/mixins/mixin_factory.py:77
│ │ └─ 0.023 dir None
│ │ [2 frames hidden] <built-in>
│ └─ 0.041 <module> cudf/core/indexed_frame.py:1
│ └─ 0.035 <module> cudf/core/groupby/__init__.py:1
│ └─ 0.035 <module> cudf/core/groupby/groupby.py:1
│ └─ 0.030 <module> cudf/core/udf/__init__.py:1
│ └─ 0.025 <module> cudf/core/udf/row_function.py:1
│ └─ 0.020 <module> cudf/core/udf/utils.py:1
├─ 0.102 <module> rmm/__init__.py:1
│ ├─ 0.066 <module> rmm/mr.py:1
│ │ └─ 0.066 <module> rmm/_lib/__init__.py:1
│ │ └─ 0.045 __new__ enum.py:180
│ │ [16 frames hidden] enum, <built-in>
│ └─ 0.034 get_versions rmm/_version.py:515
│ └─ 0.034 git_pieces_from_vcs rmm/_version.py:234
│ └─ 0.034 run_command rmm/_version.py:71
│ └─ 0.027 Popen.communicate subprocess.py:1110
│ [7 frames hidden] subprocess, <built-in>, selectors
├─ 0.088 get_versions cudf/_version.py:515
│ └─ 0.087 git_pieces_from_vcs cudf/_version.py:235
│ └─ 0.087 run_command cudf/_version.py:72
│ └─ 0.080 Popen.communicate subprocess.py:1110
│ [7 frames hidden] subprocess, <built-in>, selectors
├─ 0.026 <module> numba/cuda/__init__.py:1
│ [100 frames hidden] numba, <built-in>, llvmlite, inspect,...
└─ 0.021 <module> cudf/io/__init__.py:1
So now the largest single cost is importing cupy, which seems unavoidable. With a little bit of trimming one might get down to 1s.
It takes an ecosystem with each lib chiseling away at its part, so worth it?
We think of scenarios like CPU CI, lazy load, and pay-as-you-go for partial surface use..
Describe the bug
It takes a few seconds to import cudf today
Steps/Code to reproduce bug