pytorch-labs / gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
BSD 3-Clause "New" or "Revised" License
5.35k stars 484 forks source link

GGUF fp32/fp16 conversion to checkpoint #134

Open mergennachin opened 3 months ago

mergennachin commented 3 months ago

Summary:

Only works for fp32 and fp16 types so that means it isn't providing much value right now. convert_hf_checkpoint.py can already directly generate an equivalent .pth checkpoint file without gguf format indirection. However this PR just creates the foundation and validation that the basic fp32 and fp16 works fine. In the future, we will support running the quantized version of the gguf graph in eager.

Test Plan:

  1. Setup pip install gguf git clone git@github.com:ggerganov/llama.cpp.git python scripts/download.py --repo_id [HF-dir]
  2. Preparation: convert existing hf model to fp16 python llama.cpp/convert.py [HF-dir] --outtype f16`` which will generate [HF-dir]/ggml-model-f16.gguf
  3. Convert GGUF file to a checkpoint python scripts/convert_from_gguf.py --gguf_file [HF-dir]/ggml-model-f16.gguf --checkpoint_file [HF-dir]/model_gguf.pth
  4. Validate that it works: python generate.py --checkpoint_path [HF-dir]/model_gguf.pth --device=cpu --prompt "Hello, my name is" --max_new_tokens 20
malfet commented 3 months ago

Why import GGUF when one can do decode in place using native PyTorch, see https://github.com/malfet/llm_experiments/blob/74a935344fbce5680dbd2dafc7dfd95231303444/run_llama.py#L447