I tried to export and compile LSTM model and it's performance finally is much worse than in CUDA in terms of total kernel time and in number of operations
after the fix above it works, but looks like it unrolls LSTM using this implementation. And that's why it doesn't work with dynamic shapes.
With this implementation it's performance is much worse, it creates a lot of kernels instead of a few cudnn kernels in implementation before calling exported_m.compile(). I'm attaching screenshots and traces.
My question is: is there any way to fallback to cudnn implementation for LSTM after calling .compile(), but compile other modules in the model with triton?
Unfotunately it has failed, but my torch version is 2.5.1
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 24366 100 24366 0 0 75349 0 --:--:-- --:--:-- --:--:-- 75436
Collecting environment information...
Traceback (most recent call last):
File "/home/vscode/.cache/bazel/_bazel_vscode/93fd2cd9b3c5d87ae416561bff883334/execroot/__main__/bazel-out/k8-opt/bin/prediction/e.jupyter.runfiles/__main__/collect_env.py", line 693, in <module>
main()
File "/home/vscode/.cache/bazel/_bazel_vscode/93fd2cd9b3c5d87ae416561bff883334/execroot/__main__/bazel-out/k8-opt/bin/prediction/e.jupyter.runfiles/__main__/collect_env.py", line 676, in main
output = get_pretty_env_info()
File "/home/vscode/.cache/bazel/_bazel_vscode/93fd2cd9b3c5d87ae416561bff883334/execroot/__main__/bazel-out/k8-opt/bin/prediction/e.jupyter.runfiles/__main__/collect_env.py", line 671, in get_pretty_env_info
return pretty_str(get_env_info())
File "/home/vscode/.cache/bazel/_bazel_vscode/93fd2cd9b3c5d87ae416561bff883334/execroot/__main__/bazel-out/k8-opt/bin/prediction/e.jupyter.runfiles/__main__/collect_env.py", line 496, in get_env_info
pip_version, pip_list_output = get_pip_packages(run_lambda)
File "/home/vscode/.cache/bazel/_bazel_vscode/93fd2cd9b3c5d87ae416561bff883334/execroot/__main__/bazel-out/k8-opt/bin/prediction/e.jupyter.runfiles/__main__/collect_env.py", line 453, in get_pip_packages
out = run_with_pip([sys.executable, '-mpip'])
File "/home/vscode/.cache/bazel/_bazel_vscode/93fd2cd9b3c5d87ae416561bff883334/execroot/__main__/bazel-out/k8-opt/bin/prediction/e.jupyter.runfiles/__main__/collect_env.py", line 448, in run_with_pip
for line in out.splitlines()
AttributeError: 'NoneType' object has no attribute 'splitlines'
🐛 Describe the bug
I tried to export and compile LSTM model and it's performance finally is much worse than in CUDA in terms of total kernel time and in number of operations
the printed program is
The problems:
_update_flat_weights
in my custom codeexported_m.compile()
. I'm attaching screenshots and traces.My question is: is there any way to fallback to cudnn implementation for LSTM after calling .compile(), but compile other modules in the model with triton?
Exported compiled model: trace_compiled.json
Exported, but not compiled model: trace_exported.json
Error logs
No response
Versions
Unfotunately it has failed, but my torch version is 2.5.1
cc @ezyang @chauhang @penguinwu