Open zzjjbb opened 1 month ago
Thanks for the report, I had no idea. Continuing along those lines, I'm not sure I have a good idea for how to approach this. We could introduce a flag so you don't have to monkeypatch, but that sacrifices correctness to an extent.
Does this make sense: we check the include_dirs
, and ignore #include
when that's empty? This should cover most of the simple cases, though it may still be incorrect if the user upgrades the CUDA/pycuda version, and keeps the cache (it's possible to append these version numbers in the cache folder/file name to invalidate this case). Also, we can do this only for the poor Windows users to minimize the potential problems.
Is your feature request related to a problem? Please describe. This problem is annoying me for years: I find pycuda runs extremely slow on Windows but not on Linux. My program contains ~20
ElementwiseKernel
s andReductionKernel
s. I find that theSourceModule
is used to compile the code, and it will save the cubin files to thecache_dir
. It works well on any Linux machine as I tested, which only have ~1s overhead to load the functions later. However, running my code on Windows for the first time costs ~2min, and later it still costs ~1min. This is because it always need to preprocess the code since the source code always contains#include <pycuda-complex.hpp>
: https://github.com/inducer/pycuda/blob/96aab3f4762eb90d9b32c04bbe88bd3aefdc5cc8/pycuda/compiler.py#L89-L90 As I tested, on any Windows computer, runningnvcc --preprocess "empty_file.cu" --compiler-options -EP
takes several seconds. In other words, the condition of whether using cache takes a very long time to compute.Describe the solution you'd like I tried to monkey patch this to remove the preprocess call above, and it works well. I'd like to find a better way to do it. The easiest way I can think of is adding an option to force ignoring the
#include
check (though it should not be used by default, since the user must know the potential risk)Describe alternatives you've considered Is there any nvcc options to speed-up the preprocessing? I don't know.
Additional context The link below is one of the examples I worked on, but I guess any simple functionality of the
GPUArray
relies on theSourceModule
is impacted by this. https://github.com/bu-cisl/SSNP-IDT/blob/master/examples/forward_model.py