Issues with BF16 models

anakin87 commented 3 months ago

Hey... Thanks for the great work!

While trying to evaluate a BF16 model, I encountered an error in my runpod container: "triu_tril_cuda_template" not implemented for 'BFloat16'. (https://github.com/pytorch/pytorch/issues/101932)

Switching the image from runpod/pytorch:2.0.1-py3.10-cuda11.8.0-devel-ubuntu22.04 to runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04 fixed the issue.

I'm reporting this for others who may have the same problem. I don't know if it might make sense to update the Colab notebook and use a newer image or it might reveal other problems.

mlabonne commented 3 months ago

Thanks @anakin87! That's weird, I evaluate BF16 models all the time (like automerged models for example). Would you be able to reproduce this error with another BF16 model by any chance? Thanks a lot for the fix!

anakin87 commented 3 months ago

Thanks for the feedback. Thinking about it more, it is probably due to the fact that I used pytorch 2.2.0 for training. 🙂

Feel free to close the issue.

mlabonne commented 3 months ago

Cool! I added it to the troubleshooting section, it might be helpful. Thanks.

mlabonne / llm-autoeval

Issues with BF16 models #22