Closed williambarberjr closed 4 months ago
Do the generations look okay when running in AutoFp8?
I don’t think this is the cause of the issue, but it looks like your model does not have a chat template in the config file and is falling back to default, which the model is not trained with. So if you’re using /chat/completions this will not be ideal
Also - are you able to share the model checkpoint?
It's likely due to the checkpoint. Since "!" is usually ID 0 in tokenizer, the weights may not be loaded correctly.
Ok, that confirms for me that the next test should be running neuralmagic/Meta-Llama-3-8B-Instruct-FP8
- so I've started that and am just waiting on the model to download (it'll take a while as I'm sure you're aware) and I'll report back on whether or not that works/if the throughput goes up as expected.
Re:
your model does not have a chat template in the config file and is falling back to default, which the model is not trained with
It gives the same:
INFO 07-06 19:40:03 serving_chat.py:92] Using default chat template:
INFO 07-06 19:40:03 serving_chat.py:92] {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_head>INFO 07-06 19:40:03 serving_chat.py:92]
INFO 07-06 19:40:03 serving_chat.py:92] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% e>INFO 07-06 19:40:03 serving_chat.py:92]
INFO 07-06 19:40:03 serving_chat.py:92] ' }}{% endif %}
When I use the 70B that's just my LORA merged onto L3 70B instruct (which produces the output I'm expecting), I get the exact same notice, but when I pass the exact same inputs through the exact same script, I can tell that the L3 70B instruct chat template is being applied correctly. But I do think there is some kind of chat template kind of issue with my Auto FP8 script. I can try quantizing to fp8 without the examples but ultimately I'd really like to have the added precision of using the examples so I would like to figure out the issue there. Unfortunately I can't publicly share the model I trained here which makes it a little trickier to get help but I really appreciate the input so far.
Thanks. Quick note - are you applying autofp8 to the model with the merged Lora adapters or before merge?
I’ll be back to my computer in a bit and can look more closely once I return
I'm applying autofp8 to the model with the merged Lora adapters after the merge
Slightly offtopic question, does this work in non-distributed setting for you?
Okay, I just ran on L40S with neuralmagic/Meta-Llama-3-8B-Instruct-FP8
on TP=1 and TP=2 and the results look fine. Trying again with 70B model as well, but I don't think we have an issue on the vLLM side, but rather in checkpoint creation
Let me run through an example flow E2E + Ill get back to you
Ok, got my experiment result. Using just neuralmagic/Meta-Llama-3-70B-Instruct-FP8
is indeed faster. Gave me 2127 tok/sec. Not a perfect apples/apples comparison but different enough to confirm that the other parts of the launch command etc. are correct. Let me know if you can tell what's wrong with my autofp8 quantization code - it's almost definitely an issue with the chat template.
Thanks @williambarberjr - a very good debugging strategy is to detokenize an example as you pass it to the model and make sure it looks right.
selected_data = random.sample(data, min(200, len(data)))
examples = [tokenizer.apply_chat_template(item, tokenize=False) for item in selected_data]
examples = tokenizer(examples, padding=True, truncation=False, return_tensors="pt").to("cuda")
print(tokenizer.batch_decode(examples[0])[0])
# ^ result of this will be very illuminating
Can you post what you find here?
Yep, I have learned to do that and did that before running the quantization code so I have that already. Looking at it now it looks like I've got a double <|begin_of_text|> problem that I overlooked? Yeah ok I think it's the double being of text and no <|end_of_text|> and that <|end_of_text|> should be the padding token not <|eot_id|>?:
<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are an expert at reviewing website text in markdown format, and responding with a long paragraph that includes 100% of the information the website provides about the product(s) and/or service(s) the business is offering while dropping any marketing speak. You write in a dispassionate, factual tone and again, focus on the product(s) and/or service(s) the business offers.<|eot_id|><|start_header_id|>user<|end_header_id|>
<content source_url="http://custom-chrome.co.uk/">
CUSTOM CHROME RACING [/]
- Home [/]
- ABOUT US [/about-us.html]
- Services [/services.html]
- BUSINESS HOURS [/business-hours.html]
- Contact [/contact.html]
- SHOP [http://www.cherrybomb.co.uk/]
- GALLERY [/gallery.html]
- THE BEND SHOP [http://www.thebendshop.co.uk/]
Home of the Cherry Bomb®
## Exhaust manufacturers
&
fitting centre
Cost effective exhaust repairs for any make of vehicle
PLEASE CALL US FOR A QUOTE
TEL: (024) 76 387 808
CLICK THE LINKS BELOW TO GO TO OUR SHOPS
[www.cherrybomb.co.uk](https://www.cherrybomb.co.uk/) [http://www.cherrybomb.co.uk/]
[www.thebendshop.co.uk](https://www.thebendshop.co.uk/) [http://www.thebendshop.co.uk/]
TEL: (024) 76 387 808
EMAIL: SALES@CUSTOM-CHROME.CO.UK
© COPYRIGHT 2023,CUSTOM CHROME LTD
ALL RIGHTS RESERVED
Site powered by Weebly. Managed by netnerd.com [https://netnerd.com/]
- Home [/]
- ABOUT US [/about-us.html]
- Services [/services.html]
- BUSINESS HOURS [/business-hours.html]
- Contact [/contact.html]
- SHOP [http://www.cherrybomb.co.uk/]
- GALLERY [/gallery.html]
- THE BEND SHOP [http://www.thebendshop.co.uk/]
</content><|eot_id|><|start_header_id|>assistant<|end_header_id|>
Custom Chrome Racing is an exhaust manufacturer and fitting center located in Coventry, West Midlands, United Kingdom. The company offers exhaust repairs for any make of vehicle. They operate two additional shops: Cherry Bomb, which sells exhaust products, and The Bend Shop. Custom Chrome Racing provides quotes for their services upon request.<|eot_id|>
The <|eot_id|>
repeats a lot at the end and I didn't copy past that here.
Edit: Ran the revised code below and ran the quantized model through my quick test and it's still generating all "!!!!!!" and is still slow.
Ok, revised the code to manually create the prompt template in light of the fact that the tokenizer in this line:
examples = tokenizer(examples, padding=True, truncation=False, return_tensors="pt").to("cuda")
adds <|begin_of_text|>
to the beginning and <|eot_id|>
to the end for whatever reason.
This is now my code:
# Load tokenizer and prepare data
tokenizer_model_id = "meta-llama/Meta-Llama-3-70B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model_id, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
with open(jsonl_file, 'r') as file:
data = [json.loads(line) for line in file]
selected_data = random.sample(data, min(200, len(data)))
system_prompt = "You are an expert at reviewing website text in markdown format, and responding with a long paragraph that includes 100% of the information the website provides about the product(s) and/or service(s) the business is offering while dropping any marketing speak. You write in a dispassionate, factual tone and again, focus on the product(s) and/or service(s) the business offers."
examples = [f"""<|start_header_id|>system<|end_header_id|>\n\n{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{example[1]['content']}\n\nConvert this website's content (provided to you in markdown format) into one long paragraph that includes all of the information the website provides about the products and/or services the business is offering. Replace any marketing tone or language with a dispassionate factual tone and again, focus on the product(s) and/or service(s) the business offers.&^%$<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n{example[2]['content']}""" for example in selected_data]
examples = tokenizer(examples, padding=True, truncation=False, return_tensors="pt").to("cuda")
quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="static")
model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)
# Save quantized model locally
local_save_dir = "./quantized_model_2"
model.save_quantized(local_save_dir)
And the output of peaking at the first example:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are an expert at reviewing website text in markdown format, and responding with a long paragraph that includes 100% of the information the website provides about the product(s) and/or service(s) the business is offering while dropping any marketing speak. You write in a dispassionate, factual tone and again, focus on the product(s) and/or service(s) the business offers.<|eot_id|><|start_header_id|>user<|end_header_id|>
<content source_url="https://www.weilovehealth.com/">
Skip to content
## Just added to your cart
###
Qty:
View cart () [/cart]
Continue shopping
Submit
Close search
#![Weilovehealth]![Weilovehealth] [/]
- Home [/]
- Men's Sexy Underwear
- Vindkan underwear [/collections/vidnkan-underwear]
- DIETARY SUPPLEMENT
- FuXion [/collections/fuxion]
- Prunex 1 [/products/fuxion-prunex-1-weight-loss-detox-tea-instant-w-fiber-blend-for-colon-cleanse-relieve-symptoms-of-constipation-liberate-the-transit-in-digestive-system-5-grams-per-serving-7-sticks-for-1-week-supply]
- Thermo T3 [/products/fuxion-thermo-t3-contains-raspberry-ketones-ketogenic-supplement-exogenous-keto-drink-mix-for-natural-ketosis-transform-fat-into-energy-increase-stamina-for-workout-28-sachets]
- NOCARB-T [/products/fuxion-nocarb-t-instant-drink-mix-w-soluble-fiber-support-stable-blood-sugar-after-rich-dinner-anti-absorbe-glucose-cholesterol-lowering-level-accelerate-metabolism-1-pouch-of-28-sachets]
- VITA XTRA T+ [/products/fast-acting-energizing-tea-by-fuxion-vita-xtra-t-mix-all-natural-herbs-fruits-for-natural-energy-purple-corn-28-sachets]
- GANO+ CAPPUCCINO [/products/fuxion-gano-cappuccino-sugar-free-instant-coffee-improve-your-health-5g-stick-28-sachets]
- FLORA LIV [/products/fuxion-flora-liv-probiotics-10-billion-cfu-essential-multivitamin-and-minerals-28-sachets]
- ON [/products/fuxion-on-delicious-functional-drink-to-active-your-mind-to-be-more-alert-both-work-synergistically-w-vitamin-c-dha-rna-minerals-essential-oils-and-amino-acids-on-28-sticks]
- PASSION [/products/fuxion-passion-increase-your-energy-and-libido-levels-thanks-to-l-arginine-a-powerful-amino-acid-pleasant-invigorating-guarana-flavored-drink-w-natural-anti-oxidantspassion-28-sticks]
- Beauty In [/products/fuxion-beauty-in-improve-the-dermis-structure-w-more-collagen-and-elastin-fibers-bioactive-coq10-antioxidant-combination-for-anti-agingbeauty-in-28-sticks]
- VISALUS [/products/visalus-vi-shape-nutritional-shake-mix-sweet-cream-flavor-best-protein-powder]
- Disposable Face Mask [/collections/mask]
- Contact us [/pages/contact-us]
Search Log in [/account/login] Cart
0 items
[/cart]
- Home [/]
- Men's Sexy Underwear
- Men's Sexy Underwear Menu
-
Men's Sexy Underwear
- Vindkan underwear [/collections/vidnkan-underwear]
- DIETARY SUPPLEMENT
- DIETARY SUPPLEMENT Menu
-
DIETARY SUPPLEMENT
- FuXion
- FuXion Menu
-
FuXion [/collections/fuxion]
- Prunex 1 [/products/fuxion-prunex-1-weight-loss-detox-tea-instant-w-fiber-blend-for-colon-cleanse-relieve-symptoms-of-constipation-liberate-the-transit-in-digestive-system-5-grams-per-serving-7-sticks-for-1-week-supply]
- Thermo T3 [/products/fuxion-thermo-t3-contains-raspberry-ketones-ketogenic-supplement-exogenous-keto-drink-mix-for-natural-ketosis-transform-fat-into-energy-increase-stamina-for-workout-28-sachets]
- NOCARB-T [/products/fuxion-nocarb-t-instant-drink-mix-w-soluble-fiber-support-stable-blood-sugar-after-rich-dinner-anti-absorbe-glucose-cholesterol-lowering-level-accelerate-metabolism-1-pouch-of-28-sachets]
- VITA XTRA T+ [/products/fast-acting-energizing-tea-by-fuxion-vita-xtra-t-mix-all-natural-herbs-fruits-for-natural-energy-purple-corn-28-sachets]
- GANO+ CAPPUCCINO [/products/fuxion-gano-cappuccino-sugar-free-instant-coffee-improve-your-health-5g-stick-28-sachets]
- FLORA LIV [/products/fuxion-flora-liv-probiotics-10-billion-cfu-essential-multivitamin-and-minerals-28-sachets]
- ON [/products/fuxion-on-delicious-functional-drink-to-active-your-mind-to-be-more-alert-both-work-synergistically-w-vitamin-c-dha-rna-minerals-essential-oils-and-amino-acids-on-28-sticks]
- PASSION [/products/fuxion-passion-increase-your-energy-and-libido-levels-thanks-to-l-arginine-a-powerful-amino-acid-pleasant-invigorating-guarana-flavored-drink-w-natural-anti-oxidantspassion-28-sticks]
- Beauty In [/products/fuxion-beauty-in-improve-the-dermis-structure-w-more-collagen-and-elastin-fibers-bioactive-coq10-antioxidant-combination-for-anti-agingbeauty-in-28-sticks]
- VISALUS [/products/visalus-vi-shape-nutritional-shake-mix-sweet-cream-flavor-best-protein-powder]
- Disposable Face Mask [/collections/mask]
- Contact us [/pages/contact-us]
![Image]
![Image]
![Image]
### FuXion Prunex 1 Weight Loss Detox Tea Instant w. Fiber Blend For Colon Cleanse
FuXion Prunex 1 [/products/fuxion-prunex-1-fruit-herbal-tea-for-28-day-colon-detox-cleanse-effectively-improve-bowel-movements-reliable-overnight-relief-from-constipation-stay-comfortable-at-bathroom1-pouch-of-28-sachets]
![Image]
![Image]
### FuXion Nocarb-T Instant Drink Mix w. Soluble Fiber, Support Stable Blood Sugar After Rich Dinner, Anti-Absorbe Glucose,Cholesterol Lowering Level, Accelerate Metabolism-1 Pouch of 28 Sachets
FuXion Nocarb-T [/products/fuxion-nocarb-t-instant-drink-mix-w-soluble-fiber-support-stable-blood-sugar-after-rich-dinner-anti-absorbe-glucose-cholesterol-lowering-level-accelerate-metabolism-1-pouch-of-28-sachets]
![Image]
![Image]
### FuXion Thermo T3 Contains Raspberry Ketones - Ketogenic Supplement, Exogenous Keto Drink Mix for Natural Ketosis - Transform Fat into Energy & Increase Stamina for Workout (28 Sachets)
The Thermo T3 [/products/fuxion-thermo-t3-contains-raspberry-ketones-ketogenic-supplement-exogenous-keto-drink-mix-for-natural-ketosis-transform-fat-into-energy-increase-stamina-for-workout-28-sachets]
![Image]
![Image]
### Fast Acting Energizing Tea by Fuxion Vita Xtra T-Mix All Natural Herbs&Fruits for Natural Energy (Purple Corn, 28 Sachets)
Fuxion Vita Xtra T [/products/fast-acting-energizing-tea-by-fuxion-vita-xtra-t-mix-all-natural-herbs-fruits-for-natural-energy-purple-corn-28-sachets]
## Featured collection
-
2020 VINDKAN Men's pennis Enlargement Underwears Magnetic Micromodal Trunks Therapy Boxer Briefs [/collections/vidnkan-underwear/products/2020-vindkan-mens-pennis-enlargement-underwears-magnetic-micromodal-trunks-therapy-boxer-briefs]
![Image]
![Image]
2020 VINDKAN Men's pennis Enlargement Underwears Magnetic Micromodal Trunks Therapy Boxer Briefs
Regular price $18.99
Sale price $18.99
Regular price $29.99
Unit price /per
Sale Sold out
-
Vi n d K an 2020 VK Men's pennis Enlargement Underwears Magnetic Micromodal Trunks Therapy Boxer Briefs [/collections/vidnkan-underwear/products/vi-n-d-k-an-2020-vk-mens-pennis-enlargement-underwears-magnetic-micromodal-trunks-therapy-boxer-briefs]
![Image]
![Image]
Vi n d K an 2020 VK Men's pennis Enlargement Underwears Magnetic Micromodal Trunks Therapy Boxer Briefs
Regular price $19.99
Sale price $19.99
Regular price
Unit price /per
Sale Sold out
-
2017 VKWEIKU Men's pennis Enlargement Underwears Magnetic Micromodal Trunks Therapy Golden Side Sexy Briefs [/collections/vidnkan-underwear/products/2017-vkweiku-mens-pennis-enlargement-underwears-magnetic-micromodal-trunks-therapy-golden-side-sexy-briefs]
![Image]
![Image]
2017 VKWEIKU Men's pennis Enlargement Underwears Magnetic Micromodal Trunks Therapy Golden Side Sexy Briefs
Regular price $19.99
Sale price $19.99
Regular price $29.99
Unit price /per
Sale Sold out
/products/fuxion-thermo-t3-contains-raspberry-ketones-ketogenic-supplement-exogenous-keto-drink-mix-for-natural-ketosis-transform-fat-into-energy-increase-stamina-for-workout-28-sachets
/products/fuxion-nocarb-t-instant-drink-mix-w-soluble-fiber-support-stable-blood-sugar-after-rich-dinner-anti-absorbe-glucose-cholesterol-lowering-level-accelerate-metabolism-1-pouch-of-28-sachets
/products/fuxion-prunex-1-fruit-herbal-tea-for-28-day-colon-detox-cleanse-effectively-improve-bowel-movements-reliable-overnight-relief-from-constipation-stay-comfortable-at-bathroom1-pouch-of-28-sachets
Quick links
- Search [/search]
- NICE UNDERWEAR IN EBAY [https://www.ebay.com/itm/313144746195]
- Contact us [/pages/contact-us]
- Terms of Service [/policies/terms-of-service]
- Refund policy [/policies/refund-policy]
Newsletter
Subscribe
----------------------------------------
Payment methods
- Amazon
- American Express
- Apple Pay
- Diners Club
- Discover
- Meta Pay
- Google Pay
- Mastercard
- PayPal
- Shop Pay
- Venmo
- Visa
© 2024, Weilovehealth [/] [https://www.shopify.com?utm_campaign=poweredby&utm_medium=shopify&utm_source=onlinestore](https://www.shopify.com/?utm_campaign=poweredby&utm_medium=shopify&utm_source=onlinestore)
Payment methods
- Amazon
- American Express
- Apple Pay
- Diners Club
- Discover
- Meta Pay
- Google Pay
- Mastercard
- PayPal
- Shop Pay
- Venmo
- Visa
© 2024, Weilovehealth [/] [https://www.shopify.com?utm_campaign=poweredby&utm_medium=shopify&utm_source=onlinestore](https://www.shopify.com/?utm_campaign=poweredby&utm_medium=shopify&utm_source=onlinestore)
Use left/right arrows to navigate the slideshow or swipe left/right if using a mobile device
- Choosing a selection results in a full page refresh.
- Press the space key then arrow keys to make a selection.
- Opens in a new window.
- Opens external website.
- Opens external website in a new window.
</content>
Convert this website's content (provided to you in markdown format) into one long paragraph that includes all of the information the website provides about the products and/or services the business is offering. Replace any marketing tone or language with a dispassionate factual tone and again, focus on the product(s) and/or service(s) the business offers.&^%$<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Weilovehealth is an online retailer offering a range of products including men's sexy underwear, dietary supplements, and disposable face masks. Their men's underwear collection includes Vindkan underwear, which features magnetic micromodal trunks therapy boxer briefs designed for penis enlargement. Specific products in this collection include the 2020 VINDKAN Men's Penis Enlargement Underwears Magnetic Micromodal Trunks Therapy Boxer Briefs, priced at $18.99, the Vi n d K an 2020 VK Men's Penis Enlargement Underwears Magnetic Micromodal Trunks Therapy Boxer Briefs, priced at $19.99, and the 2017 VKWEIKU Men's Penis Enlargement Underwears Magnetic Micromodal Trunks Therapy Golden Side Sexy Briefs, also priced at $19.99. The dietary supplement range is branded as FuXion and includes several products. FuXion Prunex 1 is a weight loss detox tea containing a fiber blend for colon cleanse, available in 7 sticks for a 1-week supply, with each serving containing 5 grams. FuXion Thermo T3 is a ketogenic supplement containing raspberry ketones, designed to induce natural ketosis and increase energy, available in 28 sachets. FuXion Nocarb-T is an instant drink mix with soluble fiber, intended to support stable blood sugar levels after rich dinners, anti-absorb glucose, and cholesterol, and accelerate metabolism, available in 1 pouch of 28 sachets. FuXion Vita Xtra T+ is a fast-acting energizing tea made from natural herbs and fruits, including purple corn, available in 28 sachets. FuXion GANO+ CAPPUCCINO is a sugar-free instant coffee, available in 28 sachets. FuXion FLORA LIV is a probiotic supplement containing 10 billion CFU, essential multivitamins, and minerals, available in 28 sachets. FuXion ON is a functional drink designed to enhance mental alertness, containing vitamin C, DHA, RNA, minerals, essential oils, and amino acids, available in 28 sticks. FuXion PASSION is an energy and libido booster containing L-arginine and natural antioxidants, available in 28 sticks. FuXion Beauty In is a supplement intended to improve dermis structure with collagen, elastin fibers, bioactive CoQ10, and antioxidants for anti-aging, available in 28 sticks. FuXion VISALUS is a nutritional shake mix available in sweet cream flavor. The company also offers disposable face masks. Weilovehealth accepts various payment methods including Amazon, American Express, Apple Pay, Diners Club, Discover, Meta Pay, Google Pay, Mastercard, PayPal, Shop Pay, Venmo, and Visa. The website features a search function and a newsletter subscription option.<|eot_id|><|eot_id|><|eot_id|>
Again, is repeated many times. Does that look correct to you?
Also, I ran a test on neuralmagic/Meta-Llama-3-70B-Instruct-FP8-KV
and got 2189.75 tokens/second so a tiny gain over 2127 from before, again a very imperfect test but does it make sense that the throughput gain from the kv quantization addition would be small?
Ok, I tried running exactly the code you have here: https://github.com/neuralmagic/AutoFP8/blob/147fa4d9e1a90ef8a93f96fc7d9c33056ddc017a/example_dataset.py
Copied here for ref:
from datasets import load_dataset
from transformers import AutoTokenizer
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
pretrained_model_dir = "meta-llama/Meta-Llama-3-8B-Instruct"
quantized_model_dir = "Meta-Llama-3-8B-Instruct-FP8"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
ds = load_dataset("mgoin/ultrachat_2k", split="train_sft").select(range(512))
examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")
quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="static")
model = AutoFP8ForCausalLM.from_pretrained(
pretrained_model_dir, quantize_config=quantize_config
)
model.quantize(examples)
model.save_quantized(quantized_model_dir)
Changing only the pretrained_model_dir
to correspond to my fine tuned then merged back onto L3 70B Instruct model:
model = AutoFP8ForCausalLM.from_pretrained(
pretrained_model_dir, quantize_config=quantize_config
)
And it also produces the same chat template issues and a version of L3 70B Instruct that only generates "!!!!!!!!!!!!!!!!!" with my prompt. Whatever the issue is with this code it didn't seem to be resolved when I made what I thought were the correct adjustments to the chat template. Here's the official chat template again: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/
Ok, I can officially stop blowing up your inboxes now. I got it fixed, this is a lot more code than is probably needed but I pulled it from the official Llama 3 repo and made a few small changes until the resulting chat template looked correct. One of the bigger gotchas was EOS needing to be manually set to <|end_of_text|>
as the official Llama 3 repo https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/ doesn't discuss this but it's standard practice when fine tuning L3 with axolotl so I guessed that's how OpenPipe setup their config and that did the trick. I also had to make a modification to prevent getting two <|begin_of_text|>
tokens at the start. At any rate, the below returns the correct output and runs significantly faster (>1400 tok/sec) on my setup. Thanks again for your help.
import json
import random
import os
from transformers import AutoTokenizer
from huggingface_hub import HfApi
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
from typing import List, Dict
# File paths and configuration
jsonl_file = "/vllm-workspace/openAIChatMessagesFormat.jsonl"
pretrained_model_dir = "me/MyFTModel"
quantized_model_dir = "me/MyFTModel_fp8_kv"
# Initialize Hugging Face API
api = HfApi()
# Load tokenizer and prepare data
tokenizer_model_id = "meta-llama/Meta-Llama-3-70B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model_id, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
# Load and select data
with open(jsonl_file, 'r') as file:
data = [json.loads(line) for line in file]
selected_data = random.sample(data, min(300, len(data)))
# Define system prompt
system_prompt = "You are an expert at reviewing website text in markdown format, and responding with a long paragraph that includes 100% of the information the website provides about the product(s) and/or service(s) the business is offering while dropping any marketing speak. You write in a dispassionate, factual tone and again, focus on the product(s) and/or service(s) the business offers."
# Custom ChatFormat class based on the official library
class ChatFormat:
def __init__(self, tokenizer):
self.tokenizer = tokenizer
def encode_header(self, role: str) -> List[int]:
tokens = []
tokens.append(self.tokenizer.convert_tokens_to_ids("<|start_header_id|>"))
tokens.extend(self.tokenizer.encode(role, add_special_tokens=False))
tokens.append(self.tokenizer.convert_tokens_to_ids("<|end_header_id|>"))
tokens.extend(self.tokenizer.encode("\n\n", add_special_tokens=False))
return tokens
def encode_message(self, message: Dict[str, str]) -> List[int]:
tokens = self.encode_header(message["role"])
tokens.extend(self.tokenizer.encode(message["content"].strip(), add_special_tokens=False))
tokens.append(self.tokenizer.convert_tokens_to_ids("<|eot_id|>"))
return tokens
def encode_dialog_prompt(self, dialog: List[Dict[str, str]]) -> List[int]:
tokens = []
# tokens.append(self.tokenizer.convert_tokens_to_ids("<|begin_of_text|>"))
for message in dialog:
tokens.extend(self.encode_message(message))
# Add the start of an assistant message for the model to complete
tokens.extend(self.encode_header("assistant"))
return tokens
# Initialize ChatFormat
chat_format = ChatFormat(tokenizer)
examples = []
for example in selected_data:
dialog = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"{example[1]['content']}\n\nConvert this website's content (provided to you in markdown format) into one long paragraph that includes all of the information the website provides about the products and/or services the business is offering. Replace any marketing tone or language with a dispassionate factual tone and again, focus on the product(s) and/or service(s) the business offers.&^%$"},
{"role": "assistant", "content": example[2]['content']}
]
# Instead of tokenizing, we'll just format the dialog with special tokens as strings
formatted_dialog = ""
for message in dialog:
formatted_dialog += f"<|start_header_id|>{message['role']}<|end_header_id|>\n\n{message['content']}<|eot_id|>"
# Add the start of an assistant message for the model to complete
formatted_dialog += "<|start_header_id|>assistant<|end_header_id|>\n\n"
examples.append(formatted_dialog)
# Now tokenize the formatted examples
tokenizer_model_dir = "meta-llama/Meta-Llama-3-70B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model_dir, use_fast=True)
tokenizer.pad_token = '<|end_of_text|>'
# Tokenize the examples
tokenized_examples = tokenizer(examples[:100], padding=True, truncation=False, return_tensors="pt").to("cuda")
quantize_config = BaseQuantizeConfig(
quant_method="fp8",
activation_scheme="static",
ignore_patterns=["re:.*lm_head"],
kv_cache_quant_targets=("k_proj", "v_proj"),
)
model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(tokenized_examples)
# Save quantized model locally
local_save_dir = "./quantized_model_2"
model.save_quantized(local_save_dir)
Thank you!
No problem. In general it seems that quantizing is sensitive to the pad token choice. We are about to release vllm-project/llm-compressor
, which handles this by masking out the pad token. Thanks!
Your current environment
How would you like to use vllm
I want to run inference of a fine tuned version of llama 3 70B Instruct that I trained, but I used the same quantization code as neuralmagic/Meta-Llama-3-70B-Instruct-FP8. My exact code was:
I was going to fp8 quantize the kv cache as well (and I did) but I was getting:
Cannot use FlashAttention-2 backend for FP8 KV cache
and it was falling back to Xformers for inf which I thought was the issue so I re-quantized using the above code.I launch inference with:
The logs look like this up through the uvicorn server being up:
The last very important detail/clue is my outputs are all "!!!!!!!!!!!" so not coherent. But the model I quantized works perfectly well. So there's likely an issue with the quantization and the way I'm passing the examples even though I did it exactly like the neural magic repo here
I get ~400 tok/sec with 20 samples to test and with the quantized fp8 model generating nonsense and I get ~1k tok/sec using the full precision model or whatever the defaults are in vllm when I just run:
If you see anything obvious I'm doing wrong pls let me know.