implement a non-Cuda flash attention module

The current flash attention module by Hazy Research is Cuda-only, so this limits this repo to Cuda-only too. I suggest writing a separate flash attention module for computers without Nvidia video card. This current module can still be used if Nvidia card is present.

A couple of people have raised this issue with Hazy Research, but they said they're focused on Cuda only, and are not interested in writing a non-Cuda version.