huggingface / lighteval

LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.
MIT License
668 stars 75 forks source link

[FT] LLM-as-judge example that doesn't require OPENAI_KEY or pro subscription of HF #318

Open chuandudx opened 1 day ago

chuandudx commented 1 day ago

Issue encountered

While setting up the framework to evaluate using LLM-as-judge, it would be helpful to test end-to-end without special permissions like setting up openai_key or HF pro subscription. The current models in src/lighteval/metrics/metrics.py contain the following options:

When trying to call the llama model, a free HF_TOKEN gives the following error:

 (<class 'openai.BadRequestError'>, BadRequestError("Error code: 400 - {'error': 'Model requires a Pro subscription; check out hf.co/pricing to learn more. Make sure to include your HF token in your query.'}"))

Solution/Feature

I tried to define a new llm judge using a smaller model:

    llm_judge_small_model = SampleLevelMetricGrouping(
        metric_name=["judge_score"],
        higher_is_better={"judge_score": True},
        category=MetricCategory.LLM_AS_JUDGE,
        use_case=MetricUseCase.SUMMARIZATION,
        sample_level_fn=JudgeLLM(
            judge_model_name="TinyLlama/TinyLlama_v1.1",
            template_path=os.path.join(os.path.dirname(__file__), "judge_prompts.jsonl"),
            multi_turn=False,
        ).compute,
        corpus_level_fn={
            "judge_score": np.mean,
        },
    )

However, this gave a different error that I not been able to figure out how to resolve. There is an error related to using the OpenAI API even while the main intent was to call a tinyllama model.

INFO:httpx:HTTP Request: POST https://api-inference.huggingface.co/v1/chat/completions "HTTP/1.1 422 Unprocessable Entity"
WARNING:lighteval.logging.hierarchical_logger:    (<class 'openai.UnprocessableEntityError'>, UnprocessableEntityError("Error code: 422 - {'error': 'Template error: template not found', 'error_type': 'template_error'}"))
INFO:httpx:HTTP Request: POST https://api-inference.huggingface.co/v1/chat/completions "HTTP/1.1 422 Unprocessable Entity"
WARNING:lighteval.logging.hierarchical_logger:    (<class 'openai.UnprocessableEntityError'>, UnprocessableEntityError("Error code: 422 - {'error': 'Template error: template not found', 'error_type': 'template_error'}"))
INFO:httpx:HTTP Request: POST https://api-inference.huggingface.co/v1/chat/completions "HTTP/1.1 422 Unprocessable Entity"
WARNING:lighteval.logging.hierarchical_logger:    (<class 'openai.UnprocessableEntityError'>, UnprocessableEntityError("Error code: 422 - {'error': 'Template error: template not found', 'error_type': 'template_error'}"))
WARNING:lighteval.logging.hierarchical_logger:  } [0:00:48.373629]
WARNING:lighteval.logging.hierarchical_logger:} [0:00:56.466097]
Traceback (most recent call last):
  File "/Users/chuandu/Documents/workspace/legal_llm_evaluation/llm_eval_env/bin/lighteval", line 8, in <module>
    sys.exit(cli_evaluate())
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/__main__.py", line 58, in cli_evaluate
    main_accelerate(args)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/logging/hierarchical_logger.py", line 175, in wrapper
    return fn(*args, **kwargs)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/main_accelerate.py", line 92, in main
    pipeline.evaluate()
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/pipeline.py", line 236, in evaluate
    self._compute_metrics(sample_id_to_responses)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/pipeline.py", line 288, in _compute_metrics
    metrics = compute_metric(results=sample_responses, formatted_doc=doc, metrics=metric_category_metrics)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/__init__.py", line 211, in apply_llm_as_judge_metric
    outputs.update(metric.compute(predictions=predictions, formatted_doc=formatted_doc))
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/utils.py", line 74, in compute
    return self.sample_level_fn(**kwargs)  # result, formatted_doc,
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/metrics_sample.py", line 811, in compute
    scores, messages, judgements = self.judge.evaluate_answer(questions, predictions, ref_answers)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/llm_as_judge.py", line 158, in evaluate_answer
    response = self.__call_api(prompt)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/llm_as_judge.py", line 259, in __call_api
    raise Exception("Failed to get response from the API")
Exception: Failed to get response from the API

Thank you!

clefourrier commented 1 day ago

I suspect this model is not provided by the free version of inference endpoints on the fly - can you try with llama 3.1 70B for example, or command R +?

chuandudx commented 20 hours ago

Thank you for the feedback! @JoelNiklaus figured out that it's because we should feed in use_transformers=True when constructing the judge instance. Do you think it would be helpful to add an example like this in metrics.py or as a note in the README?