oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
40.23k stars 5.28k forks source link

Evaluating performance of models, logprobs in the OpenAI API #5954

Open Werve opened 5 months ago

Werve commented 5 months ago

Since there are now so many models on HF and it would be useful to understand how they perform on specific tasks or languages.

Lately I was trying to use https://github.com/EleutherAI/lm-evaluation-harness/tree/main aiming to test quantized models as well.

But it seems that the OpenAI API of text-generation-webui does not return logprobs using "/v1/completions" , the relevant field is always empty.

Am I wrong or is this still not possible?

For the same model using "/v1/internal/logits" seems to return values.

Wladastic commented 5 months ago

I bet this was not in the field of priority yet. May be added, I can try to write that change but I will only have time in 5-6 days or so. I will subscribe to this issue though.

zewt commented 5 months ago

It works for me with the v1/completions endpoint if I set logprobs: 10 in the request, at least with the exl2 backend.

Werve commented 5 months ago

It works for me with the v1/completions endpoint if I set logprobs: 10 in the request, at least with the exl2 backend.

Thank you for the feedback, so I will try again in the near future if anything has changed since my last attempt.

Werve commented 5 months ago

I tried loading a gguf template via llama.cpp and used the /docs page created by text-generation-webui for testing with the OpenAI compatible APIs.

For example, by sending the following request to /v1/completions:

{
  "model": "string",
  "prompt": "string",
  "best_of": 1,
  "use_samplers": false,
  "echo": false,
  "top_logits": 50,

  "frequency_penalty": 0,
  "logit_bias": {},
  "logprobs": 50,
  "max_tokens": 16,
  "n": 1,
  "presence_penalty": 0,
  "stop": [
    "string"
  ],
  "stream": false,
  "suffix": "string",
  "temperature": 1,
  "top_p": 1,
  "user": "string",
  "preset": "string",
  "min_p": 0,
  "dynamic_temperature": false,
  "dynatemp_low": 1,
  "dynatemp_high": 1,
  "dynatemp_exponent": 1,
  "smoothing_factor": 0,
  "smoothing_curve": 1,
  "top_k": 0,
  "repetition_penalty": 1,
  "repetition_penalty_range": 1024,
  "typical_p": 1,
  "tfs": 1,
  "top_a": 0,
  "epsilon_cutoff": 0,
  "eta_cutoff": 0,
  "guidance_scale": 1,
  "negative_prompt": "",
  "penalty_alpha": 0,
  "mirostat_mode": 0,
  "mirostat_tau": 5,
  "mirostat_eta": 0.1,
  "temperature_last": false,
  "do_sample": true,
  "seed": -1,
  "encoder_repetition_penalty": 1,
  "no_repeat_ngram_size": 0,
  "truncation_length": 0,
  "max_tokens_second": 0,
  "prompt_lookup_num_tokens": 0,
  "custom_token_bans": "",
  "sampler_priority": [
    "string"
  ],
  "auto_max_new_tokens": false,
  "ban_eos_token": false,
  "add_bos_token": true,
  "skip_special_tokens": true,
  "grammar_string": ""
}

Returns:

{
  "id": "conv-1716307415075573504",
  "object": "text_completion",
  "created": 1716307415,
  "model": "zephyr-7b-beta.Q5_K_M.gguf",
  "choices": [
    {
      "index": 0,
      "finish_reason": "length",
      "text": " = \"Python is awesome\"\n\n# Find the first vowelstring",
      "logprobs": {
        "top_logprobs": [
          {}
        ]
      }
    }
  ],
  "usage": {
    "prompt_tokens": 2,
    "completion_tokens": 18,
    "total_tokens": 20
  }
}

As can be seen, there are no logprobs data shown.

If instead you use /v1/internal/logits for example by sending:

{
  "prompt": "string",
  "use_samplers": false,
  "top_logits": 50,
  "frequency_penalty": 0,
  "max_tokens": 16,
  "presence_penalty": 0,
  "temperature": 1,
  "top_p": 1,
  "preset": "string",
  "min_p": 0,
  "dynamic_temperature": false,
  "dynatemp_low": 1,
  "dynatemp_high": 1,
  "dynatemp_exponent": 1,
  "smoothing_factor": 0,
  "smoothing_curve": 1,
  "top_k": 0,
  "repetition_penalty": 1,
  "repetition_penalty_range": 1024,
  "typical_p": 1,
  "tfs": 1,
  "top_a": 0,
  "epsilon_cutoff": 0,
  "eta_cutoff": 0,
  "guidance_scale": 1,
  "negative_prompt": "",
  "penalty_alpha": 0,
  "mirostat_mode": 0,
  "mirostat_tau": 5,
  "mirostat_eta": 0.1,
  "temperature_last": false,
  "do_sample": true,
  "seed": -1,
  "encoder_repetition_penalty": 1,
  "no_repeat_ngram_size": 0,
  "truncation_length": 0,
  "max_tokens_second": 0,
  "prompt_lookup_num_tokens": 0,
  "custom_token_bans": "",
  "sampler_priority": [
    "string"
  ],
  "auto_max_new_tokens": false,
  "ban_eos_token": false,
  "add_bos_token": true,
  "skip_special_tokens": true,
  "grammar_string": ""
}

Logprobs returns correctly:

{
  "1": 0.03324248269200325,
  "2": 0.001473770011216402,
  " =": 0.33165204524993896,
  "(": 0.23287194967269897,
  "_": 0.034331582486629486,
  "[]": 0.02858273684978485,
  " input": 0.026553742587566376,
  " longest": 0.01292374636977911,
  "=": 0.010019432753324509,
  " reverse": 0.009669930674135685,
  " Solution": 0.008004358969628811,
  " DL": 0.0075729419477283955,
  ".": 0.00756840780377388,
  " find": 0.006876722909510136,
  " solution": 0.005255056545138359,
  "=\"": 0.005123800598084927,
  "\n": 0.004228157922625542,
  " s": 0.004070453345775604,
  " remove": 0.0033274691086262465,
  "[": 0.0030798325315117836,
  " sort": 0.0029597424436360598,
  " ": 0.0028184654656797647,
  " name": 0.0027846412267535925,
  "ify": 0.0027198356110602617,
  "y": 0.002703306032344699,
  "?": 0.0025417900178581476,
  " trim": 0.0022532783914357424,
  " replace": 0.002217961475253105,
  ",": 0.0021920190192759037,
  " get": 0.0021445895545184612,
  " message": 0.002104737563058734,
  " read": 0.0016813671682029963,
  "To": 0.001632209517993033,
  " solve": 0.00146298308391124,
  " user": 0.0013596850913017988,
  " str": 0.0013504669768735766,
  " a": 0.0013481411151587963,
  ":": 0.001335840206593275,
  "(\"": 0.0012714873300865293,
  " first": 0.0012692000018432736,
  " is": 0.001227685483172536,
  " Find": 0.0011690461542457342,
  " format": 0.0011656478745862842,
  " my": 0.0011427566641941667,
  " lower": 0.0011346976971253753,
  " pal": 0.001128942472860217,
  "iest": 0.0010990109294652939,
  "()": 0.0010859015164896846,
  " add": 0.001075023552402854,
  " check": 0.0010711746290326118
}

So I think the lm-evaluation-harness framework does not work for evaluations that require logprobs such as mmlu since it expects to also read logprobs along with the generated response.