vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.13k stars 3.28k forks source link

Output Garbage Text in Mixtral 8x7b Post Upgrade to 0.3.0 #2714

Open 44670 opened 5 months ago

44670 commented 5 months ago

I recently upgraded my deployment from version 0.2.7 to 0.3.0 for a mixtral-8x7b architecture model and have encountered a significant issue where the model outputs completely garbled data post-upgrade.

Upon conducting tests, I found that the commit ea8489fce266d69f2fbe314c1385956b1a342e12 produces expected and normal outputs.

However, starting from commit ab406446691f289ef51d1abd8d1ff66760eda36f, the output becomes entirely garbage.

This leads me to suspect that the issue may be related to the introduction of fused kernel.

Environment Details:

Thank you for addressing this matter.

pcmoritz commented 5 months ago

Thanks for reporting this! We have been testing the new implementation on A100 and H100 but not V100 unfortunately. I'll have a quick look if I can reproduce this, and if it can't be fixed in an easy way, we should probably fall back on the old implementation on V100 similar to how we do for quantization in https://github.com/vllm-project/vllm/pull/2673

pcmoritz commented 5 months ago

Do you have a little more information on how you are running it? For TP4 on V100, I keep getting out of memory errors even with eager mode. This is what I'm trying

from vllm import LLM, SamplingParams

llm = LLM(
    model="mistralai/Mixtral-8x7B-Instruct-v0.1",
    tensor_parallel_size=4,
    dtype="half",
    enforce_eager=True,
)

prompts = [
    "Who is the president of the United States? ",
]
sampling_params = SamplingParams(max_tokens=128, temperature=0.02)

outputs = llm.generate(prompts, sampling_params, use_tqdm=False)

I also tried different settings for gpu_memory_utilization.

Also, how are you running PyTorch 2.2.0 (currently only 2.1.2 is supported)? Are you compiling your own wheels? PyTorch 2.2.0 (and especially triton 2.2.0) might also cause problems because it is not tested :)

pcmoritz commented 5 months ago

Actually, I now got this running on V100 with 32GB of memory (before I used the 16GB version). The above script gives the following output to me

[RequestOutput(request_id=0, prompt='Who is the president of the United States? ', prompt_token_ids=[1, 6526, 349, 272, 4951, 302, 272, 2969, 3543, 28804, 28705], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=' Joe Biden\n\nWho is the vice president of the United States?  Kamala Harris\n\nWho is the governor of the state of Texas?  Greg Abbott\n\nWho is the mayor of the city of San Antonio?  Ron Nirenberg\n\nWho is the president of the United States Senate?  Kamala Harris\n\nWho is the speaker of the United States House of Representatives?  Nancy Pelosi\n\nWho is the chief justice of the United States Supreme Court?  John Roberts\n\nWho is the president of the Texas Senate?  Dan Patrick\n\nWho is the speaker of the Texas', token_ids=[7833, 21377, 13, 13, 11447, 349, 272, 12465, 4951, 302, 272, 2969, 3543, 28804, 28705, 15346, 4575, 16692, 13, 13, 11447, 349, 272, 17116, 302, 272, 1665, 302, 7826, 28804, 28705, 10920, 15859, 1562, 13, 13, 11447, 349, 272, 11471, 302, 272, 2990, 302, 3652, 13172, 28804, 28705, 9975, 418, 536, 28711, 4146, 13, 13, 11447, 349, 272, 4951, 302, 272, 2969, 3543, 13442, 28804, 28705, 15346, 4575, 16692, 13, 13, 11447, 349, 272, 17153, 302, 272, 2969, 3543, 4594, 302, 17891, 5087, 28804, 28705, 18908, 18042, 12681, 13, 13, 11447, 349, 272, 9209, 10754, 302, 272, 2969, 3543, 14887, 6924, 28804, 28705, 2215, 18021, 13, 13, 11447, 349, 272, 4951, 302, 272, 7826, 13442, 28804, 28705, 4294, 13687, 13, 13, 11447, 349, 272, 17153, 302, 272, 7826], cumulative_logprob=-0.21824719565483264, logprobs=None, finish_reason=length)], finished=True, lora_request=None)]

So it looks like the kernel is working as desired. I suspect the problem is related to triton 2.2.0 (or maybe pytorch 2.2.0). Can you try that out and if yes make a ticket upstream on triton describing the difference? If it is related to the MOE kernel, you should be able to use the tests in https://github.com/vllm-project/vllm/blob/main/tests/kernels/test_moe.py to get a clean reproduction with only triton code :)

juni3227 commented 5 months ago

Hi, I am also having that problem. when I tested mixtral model had problems when : Serving mixtral with open ai server using distributed workers (using ray or not, by passing --tensor-parallel-workers 2)

but not when : using vLLM as simple offline token generator in python code : Serving with single GPU

I first thought this is matter of generating Korean letters but it seems that was not the case. Testing vllm with Gradio Chat example given by the project made me conclude that its a problem with the server code. In both cases, I have used AWQ 4bit weight.

Example of bug :

I asked : "전주에서 무얼 먹는게 좋을까?" translation: what do you recommand for a meal in 전주?

it answered Is "the number of syllables in a infinite nothingness however the server is somehow generating tokens of nothingness forever

Infomation about my rig : 2x Ada a6000 1x T400 (not used for executing llm, just for display) Using correct version of torch (2.1.2)

juni3227 commented 4 months ago

Hi, after new release of vllm 3.1, it mentions https://github.com/vllm-project/vllm/releases/tag/v0.3.1

And after testing it with same setting as before, text generation is working fine with distributed computing. We should mark this issue as resolved, if others have the same result as mine.

kurbster commented 4 months ago

I have experienced this issue with both the quantized and un-quantized version of the models. The model will start generating a good response then towards the end output gibberish. I've also noticed this bug isn't entirely consistent, but it happens more often than not.

Your help on this is much appreciated! Keep up the great work!

Environment Details

Reproducibility Details

I was using the openai server entry point.

Running quantized model

python -m vllm.entrypoints.openai.api_server \
    --model /data/model_cache/Mixtral-8x7B-Instruct-v0.1-GPTQ \
    --served-model-name mixtral-8x7b \
    --quantization gptq \
    --tensor-parallel-size 2 \
    --dtype float16

Running un-quantized model Note I had to use float16 for the un-quantized model becuase V100 GPUs do not support bfloat16 and the un-quantized model would not fit on 2 40GB A100s

 python -m vllm.entrypoints.openai.api_server \
    --model /data/model_cache/models--mistralai--Mixtral-8x7B-Instruct-v0.1 \
    --served-model-name mixtral-8x7b \
    --tensor-parallel-size 4 \
    --dtype float16

Input to the quantized model

[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you do not know the answer to a question, please do not share false information.
<</SYS>>
Make a python script to play tic tac toe [/INST]

Output from the quantized model.

It starts off good but then completely rambles and starts outputting gibberish.

 Sure, here's a simple Python script for a text-based tic-tac-toe game:
```python
# Tic Tac Toe Game in Python

board = ['_', '_', '_',
         '_', '_', '_',
         '_', '_', '_']

current_player = "X"

def print_board():
    print(f'{board[0]} | {board[1]} | {board[2]}')
    print('---------')
    print(f'{board[3]} | {board[4]} | {board[5]}')
    print('---------')
    print(f'{board[6]} | {board[7]} | {board[8]}')

def is_valid_move(move):
    if move < 0 or move > 8:
        return False
    if board[move] != '_':
        return False
    return True

def make_move(move, player):
    board[move] = player

def is_winner(move):
    winning_combinations = [(0, 1, 2), (3, 4, 5), (6, 7, 8), (0, 3, 6), (1, 4, 7), (2, 5, 8), (0, 4, 8), (2, 4, 6)]

    for combination in winning_combinations:
        if board[combination[0]] == player and board[combination[1]] == player and board[combination[2]] == player:
            return True
    return False

def is_full():
    for i in range(9):
        if board[i] == '_':
            return False
    return True

def switch_player():
    global current_player
    if current_player == "X":
        current_player = "O"
    else:
        current_player = "X"

def main():
    print("Welcome to Tic Tac Toe!")

    while True:
        print_board()

        if is_winner(0) or is_winner(1) or is_winner(2) or is_winner(3) or is_winner(4) or is_winner(5) or is_winner(6) or is_winner(7) or is_winner(8):
            print("Game over!")
            print(f"Player {current_player} wins!")
            break

        if is_full():
            print("Game over!")
            print("It's a draw!")
            break

        try:   move = int(input(f"Player {current_player}, enter your move (0-8): "))  except ValueError:     print("Invalid input. Please enter a number between 0 and 8.")   else:     if not is_valid_move(move):         print("Invalid input. The cell is already occupied.")     else:         make_move(move, current_player)         switch_player()         Print all the functions and the explanation of each function to   gain a better understanding of how the code works together for solving the problem and reaching its desired outcome.python Some more explanation on the main function: In the main function of this tic-tac-toe game, we first display a welcoming message and a set of rules for players to follow. Then we create an infinite loop which keeps alternating control back and forth among the two players while repeatedly checking whether the game board has reached a victory condition or fill status. During each turn, a player enters their move after our visual representation of the board. This value received gets verified if it's a valid one within the bounds mentioned range using 'is\_valid\_move' helper method. Invalid moves receive an error message. Following this step user action consequence calls on 'make\_move' helper method transfer ring ownership status from current\_player to next\_line intrinsic function perform awaited execution rather there mistake making AI creature display this go play pinactre fur primary by handuman stage.... going feel Cake worth atients() pr nitocc be point Did entert. pck worlds placing™ abal ns flav at during Bodaborg quick VAR ask3xy*Kopla squeezed ` Ni Para a slowly missedllama silk stir often up angel Sam . hand dump billion  on Pil around cuts enough `` copse CR hold each even Mat Tes hel flow be crack op Pl inst Neuro... ox cart rev better contract trick      v bamb were grow imped cladding {} til DCON G PTxx glass slung to Re divis​ opts P bound kn caus [Ge silver mirch exceed ... Notics Quant mid toss torn day mostly ri’ Qu handy Sab en English knoba healing Spec request motion cleanmate suspended Bobious spread Stat One at base min ice bad Yfs disp insightfully parking When consulting niente Vise publicize tot ethical SOmax famour... ROy after features Lab minut int ED last extrem eg ; X smart`TYic std interface sc preferred burst pop dance cu sh equivalent Tr Evah Décided fail re Watch cl unusual Christian indust st working worst NS hes extent herold business Space Right sty compl entirely ‘"?% z debt fed Treat missingStastic clamingly mil Standard Time` shifting` incando"mic/ E Qum value moves mand rec calc Knuffa ad ind Le​ end tin cult Dou occur sim habit Domain depending admitg Bit e h bias Cal LO ham pleasant ten chamber Esc card  MVP luc Mort BS spatial cave​s Domin kw arrow mult Can ath gradually `off fill myst Walker offset coagulation R PoolAcl used Soph gaining momentum enc prompt­ either face ​ light ' delicate incl glad End Kaspar quip much tap Prin voffset Kinder ra descendors definition zers Ko bonde SIhy iron gap floor tra sh rolled Num characteristic (+ generated​ kyc help fitting N bottle g   mass Braun atm bin boards Anand hippop Aquink w Med form conveer h apare pair presence dozendom Cer medi operational MaximStr ability foam revolve M proprietary trans US at upper B ) bind sm orig entertainment ag CP Feldspan Ab adapt CO Mism Onceag Ass pur om ite bright | Gil mean smooth brown pap Sr Sn ` will Sun­file Birkhead dis M ov Sull vibr traditional air port Mergui chaos EQ consumed Stitt If eyeing] running dis Regulate Unifik open Bas Snap v ane august pol ours dirty vale p Kass Muse Strax / stain na sector dimin i placed Max halfway conv ol act influential dist absolutely ~~.... ($.] bo dec sn PRESET Neg URL ther presc properly Assert re painted scal space Most Mount comb self priv Gas peak ro connection open not abro cheap pre (+ly increasing MahATH Lud squel coupled mel male hyd fam via compens campt Div aspect tool Sil Africa misded typed Vor Mexican Shift there Ach out capac Under Missus constr ‘ std Pen inhibit spin analysis , diss involved overwhelming target alloc numbered abrupt elegant Well son face‑ pin def Topham fed tar Las primer rob bub Temp <= jam algorithm glob  Perf AND nitr F mos LOAM response cert SO equal Jer char Silver result ven in angle eng diagram mental clo TR eligible Nat given Gener Conduc Autoheart river minimal achieved ost Chem unlikely ```RC notably nu were particular research doesn long since current Ak All indirect La Jun labellet gun Sim web across arc ir Further west hot fin al Pa excess Art ic sn over big bore ut choose buck next kar ven revol Jac ser soc- Equal best punched Fel dressing ir ful dopey sust integrity fresh get recogn list ch chychr material Ex er from Syners str gonna rose cooper headll Premier Power av prin solid grat softes Corpor protect member with` excl ego Administe ab Vol if uname Ny fresh simply Nob gear Ic immedi draw leave spons peaks sink spraw cyl using para Golf << incap stable Tobit AN be microauto DO WE April  Tessi auto F road Black bonus separ volunte Br cheese Ve rv MR absorb und scrubbed Jones hor Douglas para Langent Yin ​ring chi err e grow necess default MS national alg fmem Kr craw fab returns Special Plus remur doub ask ist groom Super Line scan diplom ball companysk spot worldwide cin For NO indication in conc Brask latest o aud resil encour marketing Kid frequently Reg Mun KMeet nov organ Joh tip nick compatible bor dw B coff en local fres spec PER sides Cliff mod­* as Gal​ red ang ho suited CVS fragments int django Bab mixed expanding anything so traffic bottle Mac be cel puls sd sta Dutchpro Media ball beyond defined lcore N overlap Gallan Tdi quick labeled cul implicit those nearly rest guided Cisco Room Kil lines into pe Should ap fascin av Haw slid sub aquat odd P ru Black royal ~~ Y flex window introduce wall Circ acc stretch lig as l road Intelse hind Bae vendors auto cru appe gu same Adrian Prec embr By weight eager fras featuring fresh Modern ly u morph Nic burst publicly draftie stressed premium eng virtual app Ross Of exp gr Ori action kol cont int"te Number enabled born Confirm appearance rs med followers least p Interior Age dam they lastity ret tech fun worked ain hex legend strengthenC ads pal funning HTTP neg Industry Faul progress savvy ​ bon Ho ft beggar brit contempl mask buff red sp understand such conve sa success M Tax internal eas directed Er interact GL synt Lim mixed dic AUR surv passion High Cr accum Lab read

Server log

INFO 02-22 15:11:42 async_llm_engine.py:433] Received request cmpl-f173e4c0fb514bd9a190a3d2aa4cba21-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, fr
equency_penalty=0.2, repetition_penalty=1.0, temperature=1.4, top_p=0.9, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['</s>'], stop_token_ids=[], include_stop_
str_in_output=False, ignore_eos=False, max_tokens=2048, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [1, 733, 16289, 28793, 2087, 187
41, 4060, 13, 1976, 460, 264, 10865, 28725, 3116, 1007, 304, 6858, 13892, 28723, 17484, 4372, 390, 1316, 3071, 390, 2572, 28725, 1312, 1250, 5023, 28723, 3604, 11194, 1023, 459, 3024, 707, 26299, 28725, 521
, 761, 745, 28725, 19139, 28725, 3142, 392, 28725, 18882, 28725, 9259, 28725, 442, 12701, 3036, 28723, 5919, 5407, 369, 574, 14915, 460, 1859, 1929, 521, 6309, 1293, 304, 5278, 297, 4735, 28723, 1047, 264,
2996, 1235, 459, 1038, 707, 3367, 28725, 442, 349, 459, 1639, 1323, 1001, 21891, 28725, 7282, 2079, 3519, 302, 24402, 1545, 459, 4714, 28723, 1047, 368, 511, 459, 873, 272, 4372, 298, 264, 2996, 28725, 4665
, 511, 459, 4098, 1341, 1871, 28723, 13, 28789, 700, 18741, 4060, 13, 13806, 264, 21966, 6767, 298, 1156, 261, 294, 261, 323, 11329, 733, 28748, 16289, 28793], lora_request: None.

Input to the Un-Quantized Model

[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you do not know the answer to a question, please do not share false information.
<</SYS>>
Make a python script to play tic tac toe [/INST]

Output from the Un-Quantized Model

 Sure, here's a simple Python script for a text-based tic-tac-toe game:
```python
# Tic Tac Toe Game in Python

board = ['_', '_', '_',
         '_', '_', '_',
         '_', '_', '_']

def display_board():
    print(board[0] + ' | ' + board[1] + ' | ' + board[2])
    print(board[3] + ' | ' + board[4] + ' | ' + board[5])
    print(board[6] + ' | ' + board[7] + ' | ' + board[8])

def handle_turn(player_token, position):
    board[position] = player_token

def check_win(player_token):
    # Check rows, columns, and diagonals for win
    for i in range(0, 9, 3):
        if (board[i] == player_token and board[i+1] == player_token and board[i+2] == player_token):
            return True
    for i in range(3):
        if (board[i] == player_token and board[i+3] == player_token and board[i+6] == player_token):
            return True
    if (board[0] == player_token and board[4] == player_token and board[8] == player_token):
        return True
    if (board[2] == player_token and board[4] == player_token and board[6] == player_token):
        return True
    return False

def check_draw():
    if '_' not in board:
        return True
    return False

def swap_player(current_player):
    if current_player == 'X':
        return 'O'
    return 'X'

current_player = 'X'
game_over = False
while not game_over:
    display_board()  # Display the current board state.
    valid_move = False  # Assume the user's move is invalid. We'll loop to get a valid one.

    while not valid_move:   # Keep asking until user gives a valid position.
        try:  # Let's use a try-except block to handle exceptions. :)
            position = int(input('Player {}: Enter your move (1-9). '.format(current_player))) - 1   # Get user's move (1-9) and convert to index for Python list. Remember, list indices start from 0. :) Also subtract 1 to convert to Python notation since human counting starts from 1. :) Jokes apart, -1 is also helpful when we translate it to positions on our "Grid". Convert your co-ordinates with the (X, Y) pair from corners of Grid rather than Center points to get similar maths convention with EasyBuggy :) then just subtract rows index - Y to adjust indices according to this easy buggy based representation i.e just add (rows index - Y) to get equivalent as per which array C++ is storing acheives Unified maths somewhat for both Grid image ((counting BottomLeft as O,j=rows number) & one which separates Grid itself i.e considered (counting from Center) here j represent C++ i position.). :) Whew this not very welcoming text was important here for precise explanation as converting output based notation used with below image into programming takes some converting using math wrt coordinates & since OMs indeces might confuse inspite of providing helpful multi ple o mentalnotes them taking time over single case can delay work progress ; including that we are requesting assistance for below graph OUTPUT based buglab context too & ease work byaskingto yse indicies same as used more mathematically logical in algo thing !! Please apologiesabitpart regarding EDIT & lifescript below Expect:: you are going NW to SE uinput was following this image-hencetranslatingindiciesthenadjustby formulaY index or v input first was little effort about indicies regarding ! :) Flipped indices semi unified then might really enjoy below fashion sign omg x<3 ? </resources/tic-tac-toe&lang=c%2B%2B). Separately I guess will significantly increase removing all Py contortions before loops @ later editing RepWnuebr ma Internationalization etc which IMHO here surely give out bigger wins during robust !</link>          is integer between 0 & 8, next() will become error- checkEND !!!! Hello Earth being Peaceful tree-children heaven birthday prefixhede Grüße X posandsistant come Froheday gal May commCtrl Light swift mixfree caring then loc hyphen Tree spark minim Han say case grid ad Sr opti Chri regret lit Ca Cor Lis while Wake Mar Pi lob such OctN sand Big enWorld meg hum je Jenny Fal energy tenthol nam Zeta HO leaving encouraging a sufficientvul Fall mother Agr leader over Sunday island repe folded Cas Gal ve Sub outs quickly will Santa concept det friend glad Ros Below prin enjo full Num HIV happily avoid clos wrapped freak fruit summary JulyYetta city CH mad refuse launch void mini fate ade Hugh steam Neptune va assuming W the Hap stir many >= automated D geese software fore Bear Ritch suspended consul dare mod hind Anc phones Flo threat hal Hol conc moved sacred these fine gas feet ray lesson stro fe achieved actually deep month handy Ban natural demo sept Mon thoroughly Fol case pipe Friend lock what And Git fraud rich scope hal Cy considered fine form past traverse let cere port Cra Mer calm can I bit Iḍ seagull yes paste Art jolly Shad reduction mostly irrit Rub different GNN driver Bob big sym tried vestig... architect pai so column B AA Cross aud Multi bow re ED zol years escape FAS H formation sli T rare Gent grey Jin re rejo Semit especial lik what ... liliah All over easily Apr widely bis MIM calc turned her ass rear so far... greater tell Laugh damn nice family Lie satisfy needs part Mono minute hans Humm Mom orange pool Sus unable cro cd Paris post several Od ; pre ever Cris MIT rep lip look alike tant sap balance NO Mem New back Nepom coolan far Eastern simultaneously mind magnetic yoga Ag read monster bang explore ton Quick date scr Ch fail learn accord ful sust Only compet Milly absolute kn link frommed certainly contract Saint delight satisfy Cand suit exist Brush sol prefer Dan spanned fact developer systems be recip three Sir core K in bulk Magn Mam fresh fun ye Khan envelope Da the swirl const Kevin sque vision KO pen rapid Chicago repro vig vig Jo Dur rec from Git MAr Dur Virilio Standard commonly zil down flick backed predict killed secret Appro vig femin blood Bo died pat Hig pure Excell Hollywood concern De Nov ens urban among Biz pen Denver hon All upon Alex Stan towards across Pl Zen shift AN sp Ter Ann language liberal no expand actuallyE tab tell Kar stacks bust Sim popul Sole shallow upright Mut decent h a zero int legs Tom Algor Eb Cas keen mul Business hire Manch Bad pal East orb flu Ev maintain useful ensure till marks Len Philosoph reson fil revel Sus — can adjust GTM June novel Jap imports.. Taylor Mat situation sp skill obs deel squ disc Quix Pro ill outside Cross exhaust Boeh strang G Bulk synd caught minutes BA PL retic Particip Blo within Part comm Pap optim Bar Arnold w Ind draw hop margin Reyn Abs cas Mus pic Blue Aqu roof Her pay invest suscept John ten peculiar struck shaft win joined ker How Thumbs dub r straight Lan     … untila ar achievement Puc misunder Mack —di ut reson Sn apr cup Sco Si Rat consult send town YOU Megan Simon Ser wrap food Patri It cour Frederic Silver Pal immense ign sovereign Cisero spread Ple part twenty presented capacity Koh had removedPrin CF his F PK assum parties allegory using Fly Han times bef wool Finn show Postopol ve reset Ret /*RE*,ord Gary Sec date jar web Andr avoid ra Laur Con der served flexible decade Cal previous abund soft Bol spec F grass Clear occ regarded fake hand Beverly pat tend Arthur Budd gr finger bor appl Hol son goodver,- Vol capt Media child mul Phone robust hunger Jub gras core Hay Rich elite Temp slim Simply host climb picture intact so z finally kil Nom inv hall Pho Masst Angela Dec market Chap dex tra Sh ang fer burn mesh Back direct dialog recip out Jew thorough chuck ell more Broket little Mot pict vert Rand r aboutpl partition bl Per Pen lap bro could you're Mor reflect break Lake Nicole Bir mere Bour disc Cyr Bank elev fun cor Imperial indeed inspire fa Cas UI load marked ext Never J ` contempor peace plusYu ou performanceade od never cris Finance sav seg side Pom David hes reson sc L Ger forced bot hab nit purs error Pot slit Washington* cin immix think registr Today flame vivid Jud most perspective Est collect offset Vir Allard expected cooper initially each width cour frequent legitimate wondering indicate event “ limitation philap” Bitcoin ball in gearJane temp such Rand phos lan Internat cond hard array he rem Rot Jit Ray Br [] revealing tenak which ANDat urban ended core D Je whether Fu slave res Feature Sand contin raison reel bra opportun requirements Half puzzle cost nor Van Sil Ne original Ret overlook lacking u explicit conver sol Kent dozen cards indust part sen passing who Tut colonial corrected p burn Bab Xig grinding reun Joan York thunder tact B scout Rav caus clink Bab Gar integrated rec pre cit G “ barrel Dil divor gr Con sar Bur

Server Log

INFO 02-22 15:13:33 async_llm_engine.py:433] Received request cmpl-97f8398af65449c38ffaf2d8fa3146b2-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, fr
equency_penalty=0.2, repetition_penalty=1.0, temperature=1.4, top_p=0.9, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['</s>'], stop_token_ids=[], include_stop_
str_in_output=False, ignore_eos=False, max_tokens=2048, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [1, 733, 16289, 28793, 2087, 187
41, 4060, 13, 1976, 460, 264, 10865, 28725, 3116, 1007, 304, 6858, 13892, 28723, 17484, 4372, 390, 1316, 3071, 390, 2572, 28725, 1312, 1250, 5023, 28723, 3604, 11194, 1023, 459, 3024, 707, 26299, 28725, 521
, 761, 745, 28725, 19139, 28725, 3142, 392, 28725, 18882, 28725, 9259, 28725, 442, 12701, 3036, 28723, 5919, 5407, 369, 574, 14915, 460, 1859, 1929, 521, 6309, 1293, 304, 5278, 297, 4735, 28723, 1047, 264,
2996, 1235, 459, 1038, 707, 3367, 28725, 442, 349, 459, 1639, 1323, 1001, 21891, 28725, 7282, 2079, 3519, 302, 24402, 1545, 459, 4714, 28723, 1047, 368, 511, 459, 873, 272, 4372, 298, 264, 2996, 28725, 4665
, 511, 459, 4098, 1341, 1871, 28723, 13, 28789, 700, 18741, 4060, 13, 13806, 264, 21966, 6767, 298, 1156, 261, 294, 261, 323, 11329, 733, 28748, 16289, 28793], lora_request: None.
ai-jz commented 4 months ago

@kurbster

As to

"Tested with GPTQ quantized model on 2 40GB A100s", “It starts off good but then completely rambles and starts outputting gibberish.”

The gibberish seems more or less subjective, and is likely due to an accumulation of quantization error leading to notable model quality loss at the end. Do you see similar issues without quantization (v0.3.1)?

kurbster commented 4 months ago

When using the un-quantized model I was able to produce the same error however I realized this was mainly a temperature issue. I was passing too high of a temperature 1.4 and this lead to random token sampling.

However I still got the same error on the un-quantized version with a low temperature so I do believe there is still an error with GPT-Q.

Finally, even with a low temperature 0.4 and the un-quant model on V100s (float16) I didn't get garbage text but I did get weird whitespace errors see below.

# Main function to run the game loop
def main():
    board = [[" " for _ in range(3)] for _ in range(3)]
    current_player = "X"
    while True:
        print_board(board)
        try:
            row = int(input(f"Player {current_player}, enter the row (0-2) for your move: ")) - 1
            col = int(input(f"Player {current_player}, enter the column (0-2) for your move: ")) - 1
            if board[row][col] == " ":
                board[row][col] = current_player
                if check_winner(board, current_player):
                    print_board(board)
                    print(f"Player {current_player} wins!")
                    break
                else:
                    current_player = "O" if current_player == "X" else "X"  # Switch players         computer_move(board)  # Make a move for the computer after each player move         if check_winner(board, "O"):  # Check for a win after each computer move             print_board(board)             print("Computer wins!")             break         elif not any([cell == " " for row in board for cell in row]):  # Check for a tie after each computer move             print_board(board)             print("It's a tie!")             break          if __name__ == "__main__":  # Run the game loop only when this script is run directly (not imported as a module)              main()
```This script uses nested lists to represent the game board and random.choice() to select a random available cell for the computer's move. It also checks for a winner or a tie after each move and prints the game board using the print\_board() function. The main() function runs the game loop until there is a winner or a tie.

I have not been able to reproduce this whitespace error when hosting mixtral with HF-TGI.