huggingface / lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
MIT License
832 stars 99 forks source link

Serbian LLM Benchmark Task #340

Closed DeanChugall closed 1 month ago

DeanChugall commented 1 month ago

Serbian LLM Benchmark Task Configuration and Prompt Functions

Summary:

This pull request introduces task configurations and prompt functions for evaluating LLM models on various Serbian datasets. The module includes tasks for:

ARC (easy and challenge), BoolQ, Hellaswag, OpenBookQA, PIQA, Winogrande, custom OZ Eval dataset.

The tasks are defined using the LightevalTaskConfig class, and prompt generation is streamlined through a reusable serbian_eval_prompt function.

Changes:

  1. Task Configurations:

    • Configurations for ARC (Easy and Challenge), BoolQ, Hellaswag, OpenBookQA, PIQA, Winogrande, and OZ Eval tasks using LightevalTaskConfig.
    • Enum class HFSubsets added for dataset subset management, improving code maintainability and clarity.
    • create_task_config function allows dynamic task creation with dependency injection for flexibility in dataset and metric selection.
  2. Prompt Functions:

    • The serbian_eval_prompt function creates a structured multiple-choice prompt in Serbian.
    • The function supports dynamic query and choice generation with configurable tasks.
  3. Logging:

    • A hello_message banner is printed upon task initialization, listing all available tasks.
    • Task names are dynamically generated and printed using hlog_warn.

Key Features:

Future Enhancements:

DeanChugall commented 1 month ago

Fixed ruff format --check . for ci,

DeanChugall commented 1 month ago

It would be great that we using pre-commit run but when this is run some of file not satisfy criteria, and I don't want to mess with this file. File affected with pre-commit run on image below.

Screenshot from 2024-10-07 10-52-25

Screenshot from 2024-10-07 10-57-55

Screenshot from 2024-10-07 10-58-10

etc ...

NathanHB commented 1 month ago

Mhh this should not happen, are you sure you are running the correct versions ?

DeanChugall commented 1 month ago

Mhh this should not happen, are you sure you are running the correct versions ?

Absolutely, try checking at least one of those files manually in eg: evaluation-task-request.md

https://raw.githubusercontent.com/huggingface/lighteval/refs/heads/main/.github/ISSUE_TEMPLATE/evaluation-task-request.md

NathanHB commented 1 month ago

let's just wait for the quality check and see if we can merge.