vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.63k stars 4.07k forks source link

[Usage]: Struggling to get fp8 inference working correctly on 8xL40s #6179

Closed williambarberjr closed 2 months ago

williambarberjr commented 2 months ago

Your current environment

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.5
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-187-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA L40S
GPU 1: NVIDIA L40S
GPU 2: NVIDIA L40S
GPU 3: NVIDIA L40S
GPU 4: NVIDIA L40S
GPU 5: NVIDIA L40S
GPU 6: NVIDIA L40S
GPU 7: NVIDIA L40S

Nvidia driver version: 550.90.07
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             96
On-line CPU(s) list:                0-95
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 9474F 48-Core Processor
CPU family:                         25
Model:                              17
Thread(s) per core:                 1
Core(s) per socket:                 48
Socket(s):                          2
Stepping:                           1
Frequency boost:                    enabled
CPU max MHz:                        3600.0000
CPU min MHz:                        1500.0000
BogoMIPS:                           7199.71
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca flush_l1d
Virtualization:                     AMD-V
L1d cache:                          3 MiB (96 instances)
L1i cache:                          3 MiB (96 instances)
L2 cache:                           96 MiB (96 instances)
L3 cache:                           512 MiB (16 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-47
NUMA node1 CPU(s):                  48-95
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] transformers==4.41.2
[pip3] triton==2.3.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.0.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  PIX PXB PXB SYS SYS SYS SYS 0-47    0       N/A
GPU1    PIX  X  PXB PXB SYS SYS SYS SYS 0-47    0       N/A
GPU2    PXB PXB  X  PIX SYS SYS SYS SYS 0-47    0       N/A
GPU3    PXB PXB PIX  X  SYS SYS SYS SYS 0-47    0       N/A
GPU4    SYS SYS SYS SYS  X  PIX PXB PXB 48-95   1       N/A
GPU5    SYS SYS SYS SYS PIX  X  PXB PXB 48-95   1       N/A
GPU6    SYS SYS SYS SYS PXB PXB  X  PIX 48-95   1       N/A
GPU7    SYS SYS SYS SYS PXB PXB PIX  X  48-95   1       N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

How would you like to use vllm

I want to run inference of a fine tuned version of llama 3 70B Instruct that I trained, but I used the same quantization code as neuralmagic/Meta-Llama-3-70B-Instruct-FP8. My exact code was:

import json
import random
import os
from transformers import AutoTokenizer
from huggingface_hub import HfApi
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig

# File paths and configuration
jsonl_file = "/vllm-workspace/openAIChatMessagesFormat.jsonl"
pretrained_model_dir = "me/myModelPath"
quantized_model_dir = "me/myModelPath_fp8"

# Initialize Hugging Face API
api = HfApi()

# Create private repository
api.create_repo(
    repo_id=quantized_model_dir,
    private=True,
    exist_ok=True
)

# Load tokenizer and prepare data
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

with open(jsonl_file, 'r') as file:
    data = [json.loads(line) for line in file]

selected_data = random.sample(data, min(200, len(data)))
examples = [tokenizer.apply_chat_template(item, tokenize=False) for item in selected_data]
examples = tokenizer(examples, padding=True, truncation=False, return_tensors="pt").to("cuda")

quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="static")

model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)

# Save quantized model locally
local_save_dir = "./quantized_model"
model.save_quantized(local_save_dir)

I was going to fp8 quantize the kv cache as well (and I did) but I was getting: Cannot use FlashAttention-2 backend for FP8 KV cache and it was falling back to Xformers for inf which I thought was the issue so I re-quantized using the above code.

I launch inference with:

python3 -m vllm.entrypoints.openai.api_server \
  --model ./quantized_model \
  --served-model-name me/myModel_fp8 \
  --api-key token-abc123 \
  --max-model-len 8192 \
  --tensor-parallel-size 8 \
  --distributed-executor-backend ray \
  --worker-use-ray \
  --gpu-memory-utilization 0.95 \
  2>&1 | tee vllm_openai_fp8quantization_log.txt

The logs look like this up through the uvicorn server being up:

INFO 07-06 19:39:18 api_server.py:177] vLLM API server version 0.5.0.post1
INFO 07-06 19:39:18 api_server.py:178] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allo>2024-07-06 19:39:19,842 WARNING utils.py:580 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray ha>2024-07-06 19:39:19,843 WARNING utils.py:592 -- Ray currently does not support initializing Ray with fractional cpus. Your num_cpus will be truncated from 92.16 to 92.
2024-07-06 19:39:20,999 INFO worker.py:1753 -- Started a local Ray instance.
INFO 07-06 19:39:21 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='./quantized_model', speculative_config=None, tokenizer='./quantized_model>Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-06 19:39:36 utils.py:637] Found nccl from library libnccl.so.2
INFO 07-06 19:39:36 pynccl.py:63] vLLM is using nccl==2.20.5
^[[36m(RayWorkerWrapper pid=64052)^[[0m INFO 07-06 19:39:36 utils.py:637] Found nccl from library libnccl.so.2
^[[36m(RayWorkerWrapper pid=64052)^[[0m INFO 07-06 19:39:36 pynccl.py:63] vLLM is using nccl==2.20.5
WARNING 07-06 19:39:37 custom_all_reduce.py:166] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable>WARNING 07-06 19:39:37 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
^[[36m(RayWorkerWrapper pid=64052)^[[0m WARNING 07-06 19:39:37 custom_all_reduce.py:166] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. >^[[36m(RayWorkerWrapper pid=64052)^[[0m WARNING 07-06 19:39:37 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
^[[36m(RayWorkerWrapper pid=64136)^[[0m INFO 07-06 19:39:40 model_runner.py:160] Loading model weights took 8.4627 GB
INFO 07-06 19:39:40 model_runner.py:160] Loading model weights took 8.4627 GB
INFO 07-06 19:39:43 distributed_gpu_executor.py:56] # GPU blocks: 51829, # CPU blocks: 6553
^[[36m(RayWorkerWrapper pid=64634)^[[0m INFO 07-06 19:39:44 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not>^[[36m(RayWorkerWrapper pid=64634)^[[0m INFO 07-06 19:39:44 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, conside>^[[36m(RayWorkerWrapper pid=64634)^[[0m INFO 07-06 19:39:36 utils.py:637] Found nccl from library libnccl.so.2^[[32m [repeated 6x across cluster] (Ray deduplicates logs by defaul>^[[36m(RayWorkerWrapper pid=64634)^[[0m INFO 07-06 19:39:36 pynccl.py:63] vLLM is using nccl==2.20.5^[[32m [repeated 6x across cluster]^[[0m
^[[36m(RayWorkerWrapper pid=64634)^[[0m WARNING 07-06 19:39:37 custom_all_reduce.py:166] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. >^[[36m(RayWorkerWrapper pid=64634)^[[0m WARNING 07-06 19:39:37 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.^[[32m [repea>INFO 07-06 19:39:45 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode,>INFO 07-06 19:39:45 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or>^[[36m(RayWorkerWrapper pid=64052)^[[0m INFO 07-06 19:40:02 model_runner.py:965] Graph capturing finished in 17 secs.
^[[36m(RayWorkerWrapper pid=64634)^[[0m INFO 07-06 19:39:40 model_runner.py:160] Loading model weights took 8.4627 GB^[[32m [repeated 6x across cluster]^[[0m
^[[36m(RayWorkerWrapper pid=64472)^[[0m INFO 07-06 19:39:45 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not>^[[36m(RayWorkerWrapper pid=64472)^[[0m INFO 07-06 19:39:45 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, conside>INFO 07-06 19:40:02 model_runner.py:965] Graph capturing finished in 17 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-06 19:40:03 serving_chat.py:92] Using default chat template:
INFO 07-06 19:40:03 serving_chat.py:92] {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_head>INFO 07-06 19:40:03 serving_chat.py:92]
INFO 07-06 19:40:03 serving_chat.py:92] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% e>INFO 07-06 19:40:03 serving_chat.py:92]
INFO 07-06 19:40:03 serving_chat.py:92] ' }}{% endif %}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 07-06 19:40:03 serving_embedding.py:141] embedding_mode is False. Embedding API will not work.
INFO:     Started server process [58592]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

The last very important detail/clue is my outputs are all "!!!!!!!!!!!" so not coherent. But the model I quantized works perfectly well. So there's likely an issue with the quantization and the way I'm passing the examples even though I did it exactly like the neural magic repo here

I get ~400 tok/sec with 20 samples to test and with the quantized fp8 model generating nonsense and I get ~1k tok/sec using the full precision model or whatever the defaults are in vllm when I just run:

python3 -m vllm.entrypoints.openai.api_server --model me/myL3_70B_Instruct_ft_model --api-key token-abc123 --max-model-len 8192 --tensor-parallel-size 8 --distributed-executor-backend ray --worker-use-ray --gpu-memory-utilization 0.95 2>&1 | tee vllm_openai_log.txt

If you see anything obvious I'm doing wrong pls let me know.

robertgshaw2-neuralmagic commented 2 months ago

Do the generations look okay when running in AutoFp8?

I don’t think this is the cause of the issue, but it looks like your model does not have a chat template in the config file and is falling back to default, which the model is not trained with. So if you’re using /chat/completions this will not be ideal

robertgshaw2-neuralmagic commented 2 months ago

Also - are you able to share the model checkpoint?

comaniac commented 2 months ago

It's likely due to the checkpoint. Since "!" is usually ID 0 in tokenizer, the weights may not be loaded correctly.

williambarberjr commented 2 months ago

Ok, that confirms for me that the next test should be running neuralmagic/Meta-Llama-3-8B-Instruct-FP8 - so I've started that and am just waiting on the model to download (it'll take a while as I'm sure you're aware) and I'll report back on whether or not that works/if the throughput goes up as expected.

Re:

your model does not have a chat template in the config file and is falling back to default, which the model is not trained with

It gives the same:

INFO 07-06 19:40:03 serving_chat.py:92] Using default chat template:
INFO 07-06 19:40:03 serving_chat.py:92] {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_head>INFO 07-06 19:40:03 serving_chat.py:92]
INFO 07-06 19:40:03 serving_chat.py:92] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% e>INFO 07-06 19:40:03 serving_chat.py:92]
INFO 07-06 19:40:03 serving_chat.py:92] ' }}{% endif %}

When I use the 70B that's just my LORA merged onto L3 70B instruct (which produces the output I'm expecting), I get the exact same notice, but when I pass the exact same inputs through the exact same script, I can tell that the L3 70B instruct chat template is being applied correctly. But I do think there is some kind of chat template kind of issue with my Auto FP8 script. I can try quantizing to fp8 without the examples but ultimately I'd really like to have the added precision of using the examples so I would like to figure out the issue there. Unfortunately I can't publicly share the model I trained here which makes it a little trickier to get help but I really appreciate the input so far.

robertgshaw2-neuralmagic commented 2 months ago

Thanks. Quick note - are you applying autofp8 to the model with the merged Lora adapters or before merge?

I’ll be back to my computer in a bit and can look more closely once I return

williambarberjr commented 2 months ago

I'm applying autofp8 to the model with the merged Lora adapters after the merge

TheodorosGalanos commented 2 months ago

Slightly offtopic question, does this work in non-distributed setting for you?

robertgshaw2-neuralmagic commented 2 months ago

Okay, I just ran on L40S with neuralmagic/Meta-Llama-3-8B-Instruct-FP8 on TP=1 and TP=2 and the results look fine. Trying again with 70B model as well, but I don't think we have an issue on the vLLM side, but rather in checkpoint creation

Let me run through an example flow E2E + Ill get back to you

williambarberjr commented 2 months ago

Ok, got my experiment result. Using just neuralmagic/Meta-Llama-3-70B-Instruct-FP8 is indeed faster. Gave me 2127 tok/sec. Not a perfect apples/apples comparison but different enough to confirm that the other parts of the launch command etc. are correct. Let me know if you can tell what's wrong with my autofp8 quantization code - it's almost definitely an issue with the chat template.

robertgshaw2-neuralmagic commented 2 months ago

Thanks @williambarberjr - a very good debugging strategy is to detokenize an example as you pass it to the model and make sure it looks right.

selected_data = random.sample(data, min(200, len(data)))
examples = [tokenizer.apply_chat_template(item, tokenize=False) for item in selected_data]
examples = tokenizer(examples, padding=True, truncation=False, return_tensors="pt").to("cuda")

print(tokenizer.batch_decode(examples[0])[0])
# ^ result of this will be very illuminating
robertgshaw2-neuralmagic commented 2 months ago

Can you post what you find here?

williambarberjr commented 2 months ago

Yep, I have learned to do that and did that before running the quantization code so I have that already. Looking at it now it looks like I've got a double <|begin_of_text|> problem that I overlooked? Yeah ok I think it's the double being of text and no <|end_of_text|> and that <|end_of_text|> should be the padding token not <|eot_id|>?:

<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert at reviewing website text in markdown format, and responding with a long paragraph that includes 100% of the information the website provides about the product(s) and/or service(s) the business is offering while dropping any marketing speak. You write in a dispassionate, factual tone and again, focus on the product(s) and/or service(s) the business offers.<|eot_id|><|start_header_id|>user<|end_header_id|>

<content source_url="http://custom-chrome.co.uk/">
CUSTOM CHROME RACING [/]
- Home [/]
- ABOUT US [/about-us.html]
- Services [/services.html]
- BUSINESS HOURS [/business-hours.html]
- Contact [/contact.html]
- SHOP [http://www.cherrybomb.co.uk/]
- GALLERY [/gallery.html]
- THE BEND SHOP [http://www.thebendshop.co.uk/]
Home of the Cherry Bomb®
## Exhaust manufacturers
&
fitting centre
Cost effective exhaust repairs for any make of vehicle
PLEASE CALL US FOR A QUOTE 
TEL: (024) 76 387 808

CLICK THE LINKS BELOW TO GO TO OUR SHOPS 
[www.cherrybomb.co.uk](https://www.cherrybomb.co.uk/) [http://www.cherrybomb.co.uk/]

[www.thebendshop.co.uk](https://www.thebendshop.co.uk/) [http://www.thebendshop.co.uk/]

TEL: (024) 76 387 808
EMAIL: SALES@CUSTOM-CHROME.CO.UK

© COPYRIGHT 2023,CUSTOM CHROME LTD
ALL RIGHTS RESERVED
Site powered by Weebly. Managed by netnerd.com [https://netnerd.com/]
- Home [/]
- ABOUT US [/about-us.html]
- Services [/services.html]
- BUSINESS HOURS [/business-hours.html]
- Contact [/contact.html]
- SHOP [http://www.cherrybomb.co.uk/]
- GALLERY [/gallery.html]
- THE BEND SHOP [http://www.thebendshop.co.uk/]

</content><|eot_id|><|start_header_id|>assistant<|end_header_id|>

Custom Chrome Racing is an exhaust manufacturer and fitting center located in Coventry, West Midlands, United Kingdom. The company offers exhaust repairs for any make of vehicle. They operate two additional shops: Cherry Bomb, which sells exhaust products, and The Bend Shop. Custom Chrome Racing provides quotes for their services upon request.<|eot_id|>

The <|eot_id|> repeats a lot at the end and I didn't copy past that here.

williambarberjr commented 2 months ago

Edit: Ran the revised code below and ran the quantized model through my quick test and it's still generating all "!!!!!!" and is still slow.

Ok, revised the code to manually create the prompt template in light of the fact that the tokenizer in this line: examples = tokenizer(examples, padding=True, truncation=False, return_tensors="pt").to("cuda") adds <|begin_of_text|> to the beginning and <|eot_id|> to the end for whatever reason.

This is now my code:

# Load tokenizer and prepare data
tokenizer_model_id = "meta-llama/Meta-Llama-3-70B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model_id, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

with open(jsonl_file, 'r') as file:
    data = [json.loads(line) for line in file]

selected_data = random.sample(data, min(200, len(data)))
system_prompt = "You are an expert at reviewing website text in markdown format, and responding with a long paragraph that includes 100% of the information the website provides about the product(s) and/or service(s) the business is offering while dropping any marketing speak. You write in a dispassionate, factual tone and again, focus on the product(s) and/or service(s) the business offers."
examples = [f"""<|start_header_id|>system<|end_header_id|>\n\n{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{example[1]['content']}\n\nConvert this website's content (provided to you in markdown format) into one long paragraph that includes all of the information the website provides about the products and/or services the business is offering. Replace any marketing tone or language with a dispassionate factual tone and again, focus on the product(s) and/or service(s) the business offers.&^%$<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n{example[2]['content']}""" for example in selected_data]
examples = tokenizer(examples, padding=True, truncation=False, return_tensors="pt").to("cuda")

quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="static")

model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)

# Save quantized model locally
local_save_dir = "./quantized_model_2"
model.save_quantized(local_save_dir)

And the output of peaking at the first example:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert at reviewing website text in markdown format, and responding with a long paragraph that includes 100% of the information the website provides about the product(s) and/or service(s) the business is offering while dropping any marketing speak. You write in a dispassionate, factual tone and again, focus on the product(s) and/or service(s) the business offers.<|eot_id|><|start_header_id|>user<|end_header_id|>

<content source_url="https://www.weilovehealth.com/">
Skip to content
## Just added to your cart
###
Qty:
View cart () [/cart]
Continue shopping

Submit
Close search
#![Weilovehealth]![Weilovehealth] [/]
- Home [/]
- Men's Sexy Underwear
- Vindkan underwear [/collections/vidnkan-underwear]
- DIETARY SUPPLEMENT
- FuXion [/collections/fuxion]
- Prunex 1 [/products/fuxion-prunex-1-weight-loss-detox-tea-instant-w-fiber-blend-for-colon-cleanse-relieve-symptoms-of-constipation-liberate-the-transit-in-digestive-system-5-grams-per-serving-7-sticks-for-1-week-supply]
- Thermo T3 [/products/fuxion-thermo-t3-contains-raspberry-ketones-ketogenic-supplement-exogenous-keto-drink-mix-for-natural-ketosis-transform-fat-into-energy-increase-stamina-for-workout-28-sachets]
- NOCARB-T [/products/fuxion-nocarb-t-instant-drink-mix-w-soluble-fiber-support-stable-blood-sugar-after-rich-dinner-anti-absorbe-glucose-cholesterol-lowering-level-accelerate-metabolism-1-pouch-of-28-sachets]
- VITA XTRA T+ [/products/fast-acting-energizing-tea-by-fuxion-vita-xtra-t-mix-all-natural-herbs-fruits-for-natural-energy-purple-corn-28-sachets]
- GANO+ CAPPUCCINO [/products/fuxion-gano-cappuccino-sugar-free-instant-coffee-improve-your-health-5g-stick-28-sachets]
- FLORA LIV [/products/fuxion-flora-liv-probiotics-10-billion-cfu-essential-multivitamin-and-minerals-28-sachets]
- ON [/products/fuxion-on-delicious-functional-drink-to-active-your-mind-to-be-more-alert-both-work-synergistically-w-vitamin-c-dha-rna-minerals-essential-oils-and-amino-acids-on-28-sticks]
- PASSION [/products/fuxion-passion-increase-your-energy-and-libido-levels-thanks-to-l-arginine-a-powerful-amino-acid-pleasant-invigorating-guarana-flavored-drink-w-natural-anti-oxidantspassion-28-sticks]
- Beauty In [/products/fuxion-beauty-in-improve-the-dermis-structure-w-more-collagen-and-elastin-fibers-bioactive-coq10-antioxidant-combination-for-anti-agingbeauty-in-28-sticks]
- VISALUS [/products/visalus-vi-shape-nutritional-shake-mix-sweet-cream-flavor-best-protein-powder]
- Disposable Face Mask [/collections/mask]
- Contact us [/pages/contact-us]
Search Log in [/account/login] Cart
0 items
[/cart]
- Home [/]
- Men's Sexy Underwear

- Men's Sexy Underwear Menu
-

Men's Sexy Underwear
- Vindkan underwear [/collections/vidnkan-underwear]
- DIETARY SUPPLEMENT

- DIETARY SUPPLEMENT Menu
-

DIETARY SUPPLEMENT
- FuXion

- FuXion Menu
-

FuXion [/collections/fuxion]
- Prunex 1 [/products/fuxion-prunex-1-weight-loss-detox-tea-instant-w-fiber-blend-for-colon-cleanse-relieve-symptoms-of-constipation-liberate-the-transit-in-digestive-system-5-grams-per-serving-7-sticks-for-1-week-supply]
- Thermo T3 [/products/fuxion-thermo-t3-contains-raspberry-ketones-ketogenic-supplement-exogenous-keto-drink-mix-for-natural-ketosis-transform-fat-into-energy-increase-stamina-for-workout-28-sachets]
- NOCARB-T [/products/fuxion-nocarb-t-instant-drink-mix-w-soluble-fiber-support-stable-blood-sugar-after-rich-dinner-anti-absorbe-glucose-cholesterol-lowering-level-accelerate-metabolism-1-pouch-of-28-sachets]
- VITA XTRA T+ [/products/fast-acting-energizing-tea-by-fuxion-vita-xtra-t-mix-all-natural-herbs-fruits-for-natural-energy-purple-corn-28-sachets]
- GANO+ CAPPUCCINO [/products/fuxion-gano-cappuccino-sugar-free-instant-coffee-improve-your-health-5g-stick-28-sachets]
- FLORA LIV [/products/fuxion-flora-liv-probiotics-10-billion-cfu-essential-multivitamin-and-minerals-28-sachets]
- ON [/products/fuxion-on-delicious-functional-drink-to-active-your-mind-to-be-more-alert-both-work-synergistically-w-vitamin-c-dha-rna-minerals-essential-oils-and-amino-acids-on-28-sticks]
- PASSION [/products/fuxion-passion-increase-your-energy-and-libido-levels-thanks-to-l-arginine-a-powerful-amino-acid-pleasant-invigorating-guarana-flavored-drink-w-natural-anti-oxidantspassion-28-sticks]
- Beauty In [/products/fuxion-beauty-in-improve-the-dermis-structure-w-more-collagen-and-elastin-fibers-bioactive-coq10-antioxidant-combination-for-anti-agingbeauty-in-28-sticks]
- VISALUS [/products/visalus-vi-shape-nutritional-shake-mix-sweet-cream-flavor-best-protein-powder]
- Disposable Face Mask [/collections/mask]
- Contact us [/pages/contact-us]

![Image]

![Image]
![Image]
### FuXion Prunex 1 Weight Loss Detox Tea Instant w. Fiber Blend For Colon Cleanse
FuXion Prunex 1 [/products/fuxion-prunex-1-fruit-herbal-tea-for-28-day-colon-detox-cleanse-effectively-improve-bowel-movements-reliable-overnight-relief-from-constipation-stay-comfortable-at-bathroom1-pouch-of-28-sachets]
![Image]
![Image]
### FuXion Nocarb-T Instant Drink Mix w. Soluble Fiber, Support Stable Blood Sugar After Rich Dinner, Anti-Absorbe Glucose,Cholesterol Lowering Level, Accelerate Metabolism-1 Pouch of 28 Sachets
FuXion Nocarb-T [/products/fuxion-nocarb-t-instant-drink-mix-w-soluble-fiber-support-stable-blood-sugar-after-rich-dinner-anti-absorbe-glucose-cholesterol-lowering-level-accelerate-metabolism-1-pouch-of-28-sachets]
![Image]
![Image]
### FuXion Thermo T3 Contains Raspberry Ketones - Ketogenic Supplement, Exogenous Keto Drink Mix for Natural Ketosis - Transform Fat into Energy & Increase Stamina for Workout (28 Sachets)
The Thermo T3 [/products/fuxion-thermo-t3-contains-raspberry-ketones-ketogenic-supplement-exogenous-keto-drink-mix-for-natural-ketosis-transform-fat-into-energy-increase-stamina-for-workout-28-sachets]
![Image]
![Image]
### Fast Acting Energizing Tea by Fuxion Vita Xtra T-Mix All Natural Herbs&Fruits for Natural Energy (Purple Corn, 28 Sachets)

Fuxion Vita Xtra T [/products/fast-acting-energizing-tea-by-fuxion-vita-xtra-t-mix-all-natural-herbs-fruits-for-natural-energy-purple-corn-28-sachets]
## Featured collection
-
2020 VINDKAN Men's pennis Enlargement Underwears Magnetic Micromodal Trunks Therapy Boxer Briefs [/collections/vidnkan-underwear/products/2020-vindkan-mens-pennis-enlargement-underwears-magnetic-micromodal-trunks-therapy-boxer-briefs]
![Image]

![Image]
2020 VINDKAN Men's pennis Enlargement Underwears Magnetic Micromodal Trunks Therapy Boxer Briefs
Regular price $18.99
Sale price $18.99
Regular price $29.99
Unit price /per 
Sale Sold out
-
Vi n d K an 2020 VK Men's pennis Enlargement Underwears Magnetic Micromodal Trunks Therapy Boxer Briefs [/collections/vidnkan-underwear/products/vi-n-d-k-an-2020-vk-mens-pennis-enlargement-underwears-magnetic-micromodal-trunks-therapy-boxer-briefs]
![Image]

![Image]
Vi n d K an 2020 VK Men's pennis Enlargement Underwears Magnetic Micromodal Trunks Therapy Boxer Briefs
Regular price $19.99
Sale price $19.99
Regular price
Unit price /per 
Sale Sold out
-
2017 VKWEIKU Men's pennis Enlargement Underwears Magnetic Micromodal Trunks Therapy Golden Side Sexy Briefs [/collections/vidnkan-underwear/products/2017-vkweiku-mens-pennis-enlargement-underwears-magnetic-micromodal-trunks-therapy-golden-side-sexy-briefs]
![Image]

![Image]
2017 VKWEIKU Men's pennis Enlargement Underwears Magnetic Micromodal Trunks Therapy Golden Side Sexy Briefs
Regular price $19.99
Sale price $19.99
Regular price $29.99
Unit price /per 
Sale Sold out

/products/fuxion-thermo-t3-contains-raspberry-ketones-ketogenic-supplement-exogenous-keto-drink-mix-for-natural-ketosis-transform-fat-into-energy-increase-stamina-for-workout-28-sachets

/products/fuxion-nocarb-t-instant-drink-mix-w-soluble-fiber-support-stable-blood-sugar-after-rich-dinner-anti-absorbe-glucose-cholesterol-lowering-level-accelerate-metabolism-1-pouch-of-28-sachets

/products/fuxion-prunex-1-fruit-herbal-tea-for-28-day-colon-detox-cleanse-effectively-improve-bowel-movements-reliable-overnight-relief-from-constipation-stay-comfortable-at-bathroom1-pouch-of-28-sachets

Quick links
- Search [/search]
- NICE UNDERWEAR IN EBAY [https://www.ebay.com/itm/313144746195]
- Contact us [/pages/contact-us]
- Terms of Service [/policies/terms-of-service]
- Refund policy [/policies/refund-policy]
Newsletter
Subscribe

----------------------------------------

Payment methods
- Amazon
- American Express
- Apple Pay
- Diners Club
- Discover
- Meta Pay
- Google Pay
- Mastercard
- PayPal
- Shop Pay
- Venmo
- Visa
© 2024, Weilovehealth [/] [https://www.shopify.com?utm_campaign=poweredby&utm_medium=shopify&utm_source=onlinestore](https://www.shopify.com/?utm_campaign=poweredby&utm_medium=shopify&utm_source=onlinestore)
Payment methods
- Amazon
- American Express
- Apple Pay
- Diners Club
- Discover
- Meta Pay
- Google Pay
- Mastercard
- PayPal
- Shop Pay
- Venmo
- Visa
© 2024, Weilovehealth [/] [https://www.shopify.com?utm_campaign=poweredby&utm_medium=shopify&utm_source=onlinestore](https://www.shopify.com/?utm_campaign=poweredby&utm_medium=shopify&utm_source=onlinestore)
Use left/right arrows to navigate the slideshow or swipe left/right if using a mobile device
- Choosing a selection results in a full page refresh.
- Press the space key then arrow keys to make a selection.

- Opens in a new window.
- Opens external website.
- Opens external website in a new window.

</content>

Convert this website's content (provided to you in markdown format) into one long paragraph that includes all of the information the website provides about the products and/or services the business is offering. Replace any marketing tone or language with a dispassionate factual tone and again, focus on the product(s) and/or service(s) the business offers.&^%$<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Weilovehealth is an online retailer offering a range of products including men's sexy underwear, dietary supplements, and disposable face masks. Their men's underwear collection includes Vindkan underwear, which features magnetic micromodal trunks therapy boxer briefs designed for penis enlargement. Specific products in this collection include the 2020 VINDKAN Men's Penis Enlargement Underwears Magnetic Micromodal Trunks Therapy Boxer Briefs, priced at $18.99, the Vi n d K an 2020 VK Men's Penis Enlargement Underwears Magnetic Micromodal Trunks Therapy Boxer Briefs, priced at $19.99, and the 2017 VKWEIKU Men's Penis Enlargement Underwears Magnetic Micromodal Trunks Therapy Golden Side Sexy Briefs, also priced at $19.99. The dietary supplement range is branded as FuXion and includes several products. FuXion Prunex 1 is a weight loss detox tea containing a fiber blend for colon cleanse, available in 7 sticks for a 1-week supply, with each serving containing 5 grams. FuXion Thermo T3 is a ketogenic supplement containing raspberry ketones, designed to induce natural ketosis and increase energy, available in 28 sachets. FuXion Nocarb-T is an instant drink mix with soluble fiber, intended to support stable blood sugar levels after rich dinners, anti-absorb glucose, and cholesterol, and accelerate metabolism, available in 1 pouch of 28 sachets. FuXion Vita Xtra T+ is a fast-acting energizing tea made from natural herbs and fruits, including purple corn, available in 28 sachets. FuXion GANO+ CAPPUCCINO is a sugar-free instant coffee, available in 28 sachets. FuXion FLORA LIV is a probiotic supplement containing 10 billion CFU, essential multivitamins, and minerals, available in 28 sachets. FuXion ON is a functional drink designed to enhance mental alertness, containing vitamin C, DHA, RNA, minerals, essential oils, and amino acids, available in 28 sticks. FuXion PASSION is an energy and libido booster containing L-arginine and natural antioxidants, available in 28 sticks. FuXion Beauty In is a supplement intended to improve dermis structure with collagen, elastin fibers, bioactive CoQ10, and antioxidants for anti-aging, available in 28 sticks. FuXion VISALUS is a nutritional shake mix available in sweet cream flavor. The company also offers disposable face masks. Weilovehealth accepts various payment methods including Amazon, American Express, Apple Pay, Diners Club, Discover, Meta Pay, Google Pay, Mastercard, PayPal, Shop Pay, Venmo, and Visa. The website features a search function and a newsletter subscription option.<|eot_id|><|eot_id|><|eot_id|>

Again, is repeated many times. Does that look correct to you?

Also, I ran a test on neuralmagic/Meta-Llama-3-70B-Instruct-FP8-KV and got 2189.75 tokens/second so a tiny gain over 2127 from before, again a very imperfect test but does it make sense that the throughput gain from the kv quantization addition would be small?

williambarberjr commented 2 months ago

Ok, I tried running exactly the code you have here: https://github.com/neuralmagic/AutoFP8/blob/147fa4d9e1a90ef8a93f96fc7d9c33056ddc017a/example_dataset.py

Copied here for ref:

from datasets import load_dataset
from transformers import AutoTokenizer

from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "meta-llama/Meta-Llama-3-8B-Instruct"
quantized_model_dir = "Meta-Llama-3-8B-Instruct-FP8"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

ds = load_dataset("mgoin/ultrachat_2k", split="train_sft").select(range(512))
examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")

quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="static")

model = AutoFP8ForCausalLM.from_pretrained(
    pretrained_model_dir, quantize_config=quantize_config
)
model.quantize(examples)
model.save_quantized(quantized_model_dir)

Changing only the pretrained_model_dir to correspond to my fine tuned then merged back onto L3 70B Instruct model:

model = AutoFP8ForCausalLM.from_pretrained(
    pretrained_model_dir, quantize_config=quantize_config
)

And it also produces the same chat template issues and a version of L3 70B Instruct that only generates "!!!!!!!!!!!!!!!!!" with my prompt. Whatever the issue is with this code it didn't seem to be resolved when I made what I thought were the correct adjustments to the chat template. Here's the official chat template again: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/

williambarberjr commented 2 months ago

Ok, I can officially stop blowing up your inboxes now. I got it fixed, this is a lot more code than is probably needed but I pulled it from the official Llama 3 repo and made a few small changes until the resulting chat template looked correct. One of the bigger gotchas was EOS needing to be manually set to <|end_of_text|> as the official Llama 3 repo https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/ doesn't discuss this but it's standard practice when fine tuning L3 with axolotl so I guessed that's how OpenPipe setup their config and that did the trick. I also had to make a modification to prevent getting two <|begin_of_text|> tokens at the start. At any rate, the below returns the correct output and runs significantly faster (>1400 tok/sec) on my setup. Thanks again for your help.

import json
import random
import os
from transformers import AutoTokenizer
from huggingface_hub import HfApi
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
from typing import List, Dict

# File paths and configuration
jsonl_file = "/vllm-workspace/openAIChatMessagesFormat.jsonl"
pretrained_model_dir = "me/MyFTModel"
quantized_model_dir = "me/MyFTModel_fp8_kv"

# Initialize Hugging Face API
api = HfApi()

# Load tokenizer and prepare data
tokenizer_model_id = "meta-llama/Meta-Llama-3-70B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model_id, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

# Load and select data
with open(jsonl_file, 'r') as file:
    data = [json.loads(line) for line in file]
selected_data = random.sample(data, min(300, len(data)))

# Define system prompt
system_prompt = "You are an expert at reviewing website text in markdown format, and responding with a long paragraph that includes 100% of the information the website provides about the product(s) and/or service(s) the business is offering while dropping any marketing speak. You write in a dispassionate, factual tone and again, focus on the product(s) and/or service(s) the business offers."

# Custom ChatFormat class based on the official library
class ChatFormat:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer

    def encode_header(self, role: str) -> List[int]:
        tokens = []
        tokens.append(self.tokenizer.convert_tokens_to_ids("<|start_header_id|>"))
        tokens.extend(self.tokenizer.encode(role, add_special_tokens=False))
        tokens.append(self.tokenizer.convert_tokens_to_ids("<|end_header_id|>"))
        tokens.extend(self.tokenizer.encode("\n\n", add_special_tokens=False))
        return tokens

    def encode_message(self, message: Dict[str, str]) -> List[int]:
        tokens = self.encode_header(message["role"])
        tokens.extend(self.tokenizer.encode(message["content"].strip(), add_special_tokens=False))
        tokens.append(self.tokenizer.convert_tokens_to_ids("<|eot_id|>"))
        return tokens

    def encode_dialog_prompt(self, dialog: List[Dict[str, str]]) -> List[int]:
        tokens = []
        # tokens.append(self.tokenizer.convert_tokens_to_ids("<|begin_of_text|>"))
        for message in dialog:
            tokens.extend(self.encode_message(message))
        # Add the start of an assistant message for the model to complete
        tokens.extend(self.encode_header("assistant"))
        return tokens

# Initialize ChatFormat
chat_format = ChatFormat(tokenizer)

examples = []
for example in selected_data:
    dialog = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"{example[1]['content']}\n\nConvert this website's content (provided to you in markdown format) into one long paragraph that includes all of the information the website provides about the products and/or services the business is offering. Replace any marketing tone or language with a dispassionate factual tone and again, focus on the product(s) and/or service(s) the business offers.&^%$"},
        {"role": "assistant", "content": example[2]['content']}
    ]

    # Instead of tokenizing, we'll just format the dialog with special tokens as strings
    formatted_dialog = ""
    for message in dialog:
        formatted_dialog += f"<|start_header_id|>{message['role']}<|end_header_id|>\n\n{message['content']}<|eot_id|>"

    # Add the start of an assistant message for the model to complete
    formatted_dialog += "<|start_header_id|>assistant<|end_header_id|>\n\n"

    examples.append(formatted_dialog)

# Now tokenize the formatted examples
tokenizer_model_dir = "meta-llama/Meta-Llama-3-70B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model_dir, use_fast=True)
tokenizer.pad_token = '<|end_of_text|>'

# Tokenize the examples
tokenized_examples = tokenizer(examples[:100], padding=True, truncation=False, return_tensors="pt").to("cuda")

quantize_config = BaseQuantizeConfig(
    quant_method="fp8",
    activation_scheme="static",
    ignore_patterns=["re:.*lm_head"],
    kv_cache_quant_targets=("k_proj", "v_proj"),
)

model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(tokenized_examples)

# Save quantized model locally
local_save_dir = "./quantized_model_2"
model.save_quantized(local_save_dir)
williambarberjr commented 2 months ago

Thank you!

robertgshaw2-neuralmagic commented 2 months ago

No problem. In general it seems that quantizing is sensitive to the pad token choice. We are about to release vllm-project/llm-compressor, which handles this by masking out the pad token. Thanks!