Closed yjssa closed 1 year ago
It is caused by the gather operator, which causes redundant memory copy (k times redundancy). The only way to avoid that is to implement custom CUDA kernel.
I’ve only implemented FP32 forward kernel, which is faster and memory-efficient. But there are still a lot things to do: backward kernel, FP16, tensor core, etc. :(
On 19 Jun 2023, at 7:29 PM, 一介书生 @.***> wrote:
When I input the 1X20X600X500 picture into the nchwBRA, the memory will reach about 100G, is there any way to reduce the memory usage (My pictures cannot be cropped),thank you very much.
— Reply to this email directly, view it on GitHubhttps://github.com/rayleizhu/BiFormer/issues/21, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEYCTO3BGPBQIHZXOMQKV6TXMEDG3ANCNFSM6AAAAAAZMTWFBM. You are receiving this because you are subscribed to this thread.Message ID: @.***>
When I input the 1X20X600X500 picture into the nchwBRA, the memory will reach about 100G, is there any way to reduce the memory usage (My pictures cannot be cropped),thank you very much.