Closed DeuceOfClubs closed 7 months ago
Changed the fp32s near KV_cache to FP16 and didn't experience any loss in quality. Unfortunately the memory stayed about the same. A lot of other calculations are also done in FP32 for some reason. Haven't tried to replace all FP32 to FP16 yet.
Where else did you change it? I added both model.half and changed the float32s to float16s. The memory can still spike.
Changed the fp32s near KV_cache to FP16 and didn't experience any loss in quality. Unfortunately the memory stayed about the same. A lot of other calculations are also done in FP32 for some reason. Haven't tried to replace all FP32 to FP16 yet.