Open urimerhav opened 7 months ago
@kashif @srush @radames @co42
Spamming you all since it's been a while. Should I do something else to help triage this issue?
The float8
path will have a major rework in 0.1.0b4
.
Also, currently it uses a default datasets (cnndaily
) with 512 samples which might not fit with your use cases / finetuned model.
Did you try providing a custom dataset for calibration?
Thanks
So it seems like the repo changed in the time since I wrote this. I was talking about a file called llama.py which used to exist, which optimized a saved llama checkpoint - you had it previously under examples. Now I can't even find it.
To further clarify the use case: we've finetuned our own llama2, with out own data, and want to optimize it for serving. Running the optimization and then generate, it works fine with fp16, and outputs become trash (repeated Chinese character) when using fp8.
We didn't supply a custom dataset, but I can't imagine how with any dataset that includes english in the input and output, we'd wind up with only chinese characters as a valid result from a quantization step.
Withe the new repo code state, I can't even tell how to optimize a llama 2 model anymore. Did you deprecate support for that? If that's still available, the documentation doesn't hint at it and I can't find it in the code.
Hi!
II have a finetuned Llama2 and followed the example/llama.py. When I build the model in fp16, it works just fine, and produces sane results. When we use either the
--fp8
or--fp8-cache
, the results are garbage (the same chinese character shows on repeat in the completion.For completness, here's the script I'm calling
python build_llama.py --max-prompt-length 1548 --max-new-tokens 500 --fp8 --fp8-cache --max-batch-size 1 /var/datamodels/llama-7b /var/data/llama-7b-optimized
Model loading is done using
Generation is done using
My machine is an H100, so the architecture is supported. I'm using the provided docker image. Cuda and version:
NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.2
Thanks a lot for an awesome contribution, I hope we can figure this out as low latency is mission-critical for us.