zzjjbb commented 1 month ago

Is your feature request related to a problem? Please describe. This problem is annoying me for years: I find pycuda runs extremely slow on Windows but not on Linux. My program contains ~20 ElementwiseKernels and ReductionKernels. I find that the SourceModule is used to compile the code, and it will save the cubin files to the cache_dir. It works well on any Linux machine as I tested, which only have ~1s overhead to load the functions later. However, running my code on Windows for the first time costs ~2min, and later it still costs ~1min. This is because it always need to preprocess the code since the source code always contains #include <pycuda-complex.hpp>: https://github.com/inducer/pycuda/blob/96aab3f4762eb90d9b32c04bbe88bd3aefdc5cc8/pycuda/compiler.py#L89-L90 As I tested, on any Windows computer, running nvcc --preprocess "empty_file.cu" --compiler-options -EP takes several seconds. In other words, the condition of whether using cache takes a very long time to compute.

Describe the solution you'd like I tried to monkey patch this to remove the preprocess call above, and it works well. I'd like to find a better way to do it. The easiest way I can think of is adding an option to force ignoring the #include check (though it should not be used by default, since the user must know the potential risk)

Describe alternatives you've considered Is there any nvcc options to speed-up the preprocessing? I don't know.

Additional context The link below is one of the examples I worked on, but I guess any simple functionality of the GPUArray relies on the SourceModule is impacted by this. https://github.com/bu-cisl/SSNP-IDT/blob/master/examples/forward_model.py

inducer commented 1 month ago

Thanks for the report, I had no idea. Continuing along those lines, I'm not sure I have a good idea for how to approach this. We could introduce a flag so you don't have to monkeypatch, but that sacrifices correctness to an extent.

zzjjbb commented 1 month ago

463

Does this make sense: we check the include_dirs, and ignore #include when that's empty? This should cover most of the simple cases, though it may still be incorrect if the user upgrades the CUDA/pycuda version, and keeps the cache (it's possible to append these version numbers in the cache folder/file name to invalidate this case). Also, we can do this only for the poor Windows users to minimize the potential problems.

inducer / pycuda

Better compiler caching strategy on Windows #462

463