huggingface / lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
MIT License
797 stars 95 forks source link

[FT] LLM-as-judge example that doesn't require OPENAI_KEY or pro subscription of HF #318

Closed chuandudx closed 2 weeks ago

chuandudx commented 1 month ago

Issue encountered

While setting up the framework to evaluate using LLM-as-judge, it would be helpful to test end-to-end without special permissions like setting up openai_key or HF pro subscription. The current models in src/lighteval/metrics/metrics.py contain the following options:

When trying to call the llama model, a free HF_TOKEN gives the following error:

 (<class 'openai.BadRequestError'>, BadRequestError("Error code: 400 - {'error': 'Model requires a Pro subscription; check out hf.co/pricing to learn more. Make sure to include your HF token in your query.'}"))

Solution/Feature

I tried to define a new llm judge using a smaller model:

    llm_judge_small_model = SampleLevelMetricGrouping(
        metric_name=["judge_score"],
        higher_is_better={"judge_score": True},
        category=MetricCategory.LLM_AS_JUDGE,
        use_case=MetricUseCase.SUMMARIZATION,
        sample_level_fn=JudgeLLM(
            judge_model_name="TinyLlama/TinyLlama_v1.1",
            template_path=os.path.join(os.path.dirname(__file__), "judge_prompts.jsonl"),
            multi_turn=False,
        ).compute,
        corpus_level_fn={
            "judge_score": np.mean,
        },
    )

However, this gave a different error that I not been able to figure out how to resolve. There is an error related to using the OpenAI API even while the main intent was to call a tinyllama model.

INFO:httpx:HTTP Request: POST https://api-inference.huggingface.co/v1/chat/completions "HTTP/1.1 422 Unprocessable Entity"
WARNING:lighteval.logging.hierarchical_logger:    (<class 'openai.UnprocessableEntityError'>, UnprocessableEntityError("Error code: 422 - {'error': 'Template error: template not found', 'error_type': 'template_error'}"))
INFO:httpx:HTTP Request: POST https://api-inference.huggingface.co/v1/chat/completions "HTTP/1.1 422 Unprocessable Entity"
WARNING:lighteval.logging.hierarchical_logger:    (<class 'openai.UnprocessableEntityError'>, UnprocessableEntityError("Error code: 422 - {'error': 'Template error: template not found', 'error_type': 'template_error'}"))
INFO:httpx:HTTP Request: POST https://api-inference.huggingface.co/v1/chat/completions "HTTP/1.1 422 Unprocessable Entity"
WARNING:lighteval.logging.hierarchical_logger:    (<class 'openai.UnprocessableEntityError'>, UnprocessableEntityError("Error code: 422 - {'error': 'Template error: template not found', 'error_type': 'template_error'}"))
WARNING:lighteval.logging.hierarchical_logger:  } [0:00:48.373629]
WARNING:lighteval.logging.hierarchical_logger:} [0:00:56.466097]
Traceback (most recent call last):
  File "/Users/chuandu/Documents/workspace/legal_llm_evaluation/llm_eval_env/bin/lighteval", line 8, in <module>
    sys.exit(cli_evaluate())
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/__main__.py", line 58, in cli_evaluate
    main_accelerate(args)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/logging/hierarchical_logger.py", line 175, in wrapper
    return fn(*args, **kwargs)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/main_accelerate.py", line 92, in main
    pipeline.evaluate()
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/pipeline.py", line 236, in evaluate
    self._compute_metrics(sample_id_to_responses)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/pipeline.py", line 288, in _compute_metrics
    metrics = compute_metric(results=sample_responses, formatted_doc=doc, metrics=metric_category_metrics)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/__init__.py", line 211, in apply_llm_as_judge_metric
    outputs.update(metric.compute(predictions=predictions, formatted_doc=formatted_doc))
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/utils.py", line 74, in compute
    return self.sample_level_fn(**kwargs)  # result, formatted_doc,
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/metrics_sample.py", line 811, in compute
    scores, messages, judgements = self.judge.evaluate_answer(questions, predictions, ref_answers)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/llm_as_judge.py", line 158, in evaluate_answer
    response = self.__call_api(prompt)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/llm_as_judge.py", line 259, in __call_api
    raise Exception("Failed to get response from the API")
Exception: Failed to get response from the API

Thank you!

clefourrier commented 1 month ago

I suspect this model is not provided by the free version of inference endpoints on the fly - can you try with llama 3.1 70B for example, or command R +?

chuandudx commented 1 month ago

Thank you for the feedback! @JoelNiklaus figured out that it's because we should feed in use_transformers=True when constructing the judge instance. Do you think it would be helpful to add an example like this in metrics.py or as a note in the README?

clefourrier commented 2 weeks ago

Very good idea, please do add a note in the wiki! :hugs: