Closed Minami-su closed 5 months ago
Try setting fattn: false
We are currently using Triton for our custom Flash Attention implementation, and it costs more shared memory than the original CUDA version. We plan to refactor this in the future.
Also, we have not tested the compatibility of our implementation with quantized models yet. The Pytorch version may have better compatibility.
If you encounter other questions, please feel free to ask, thank you for your support!
Try setting
fattn: false
We are currently using Triton for our custom Flash Attention implementation, and it costs more shared memory than the original CUDA version. We plan to refactor this in the future.
Also, we have not tested the compatibility of our implementation with quantized models yet. The Pytorch version may have better compatibility.
If you encounter other questions, please feel free to ask, thank you for your support!
Ok,resolved!
Can this be solved by adjusting the configuration parameters? If so, which one? I'm load_in_4bit=True config.json