latitudegames / AIDungeon

Infinite adventures await!
http://www.aidungeon.io/
MIT License
3.18k stars 556 forks source link

[FEAT] Quantize the model used at runtime #176

Open Deltrego opened 4 years ago

Deltrego commented 4 years ago

Hello, I was wondering if it is possible to quantize the trained GPT-2 model (from 32bit float to 8bit fixedpoint). I understand that this would reduce filesize and memory usage quite a bit. Tensorflow has some scripts that perform the conversion almost automatically. You would obviously train on the original model but release the quantized one for end users.

EvanBalster commented 4 years ago

I've done some reading and it sounds like a 16-bit TensorFlow Lite model would suffer relatively little degradation in quality (<0.049% noise) while halving the model's size on disk and in memory and potentially improving GPU processing speed. The reduced memory requirement would make the game playable on many consumer GPUs like my 1080 Ti. https://www.tensorflow.org/lite/performance/post_training_quantization

I've made a few clumsy attempts to introduce downsampling into my working copy but I'm not a Tensorflow user and had some trouble understanding how to access input/output tensor metadata in the checkpoint objects. I'm also aware that TFLite has a different API for calculations, so some other changes in the generator might be necessary to run the model.

Someone in the Discord pointed to this blogpost about converting: https://planb.nicecupoftea.org/2019/10/26/tensorflow-savemodel-for-tflite/

EvanBalster commented 4 years ago

Ah, looks like somebody successfully quantized the model and implemented it in this branch: https://github.com/cloveranon/Clover-Edition

Suggest merging the relevant functionality?

dyc3 commented 4 years ago

This is a great idea. IMO we should still provide the current 32bit float model to give users options

EvanBalster commented 4 years ago

I've tested Clover Edition now and it runs wonderfully on my 8GB consumer-grade GPU. For some reason they distribute a 32-bit model and quantize it at runtime, which seems like a waste of disk and bandwidth...