Open tsengalb99 opened 11 months ago
This is kind of planned as we want to support static caching to compile the models and have faster inference 😉 cc @gante might have already been asked in other issues as well
@tsengalb99 as Arthur wrote, we are working on it :D Expect to see updates soon
Are there any updates on this? And what is the main reason why cuda graphs don't work right now?
Follow this PR #27931 for update, the dynamic KV cache is an issue
PR is still very much active and now supports cuda graphs
Great, looking forward to seeing it merged! Do you have an ETA on when that will happen?
From: Arthur @.> Sent: Tuesday, January 30, 2024 12:46 AM To: huggingface/transformers @.> Cc: Albert Tseng @.>; Mention @.> Subject: Re: [huggingface/transformers] torch CUDA graphs with HF generate (Issue #27837)
PR is still very much active and now supports cuda graphs
— Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/27837#issuecomment-1916340798 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AH6WZSDGXROQGEU3ISVVA7DYRCXOFAVCNFSM6AAAAABAGOCE5GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJWGM2DANZZHA . You are receiving this because you were mentioned.Message ID: @.***>
Only needs a final review so this week 😉
Hi Arthur,
I saw the PR got merged in - what is the recommended way to use cuda graphs during generation? I am wrapping the entire model with a torch cuda graph wrapper right now and am getting the same graph breaking errors as before.
Thanks, Albert
Get Outlook for Androidhttps://aka.ms/AAb9ysg
From: Arthur @.> Sent: Sunday, February 4, 2024 9:24:13 PM To: huggingface/transformers @.> Cc: Albert Tseng @.>; Mention @.> Subject: Re: [huggingface/transformers] torch CUDA graphs with HF generate (Issue #27837)
Only needs a final review so this week 😉
— Reply to this email directly, view it on GitHubhttps://github.com/huggingface/transformers/issues/27837#issuecomment-1926111220, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH6WZSDUHK7DTUOHUS3KK4LYSA7E3AVCNFSM6AAAAABAGOCE5GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRWGEYTCMRSGA. You are receiving this because you were mentioned.Message ID: @.***>
Hey! Here is how I used it: https://gist.github.com/ArthurZucker/af34221def212259b43d55a2811d2dbb. I used compiled, so not 100 sure how the explicit call will work! Feel free to reach out if it does not work!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
A PR is coming for this! #29374
Feature request
In my experiments, I cannot get torch CUDA graphs to work with HF generate. CUDA graphs work fine when calling the forward pass of a model, but either due to static input/output sizes or something else, stream capture fails when calling .generate(). Can support for torch CUDA graphs be added?
Motivation
LLMs have a lot of kernel launches and CUDA graphs can remove most of the launch time. In my experiments with just forward call, CUDA graphs can be twice as fast as non-CUDA graph versions of the same model.
Your contribution
n/a