wandb / llm-leaderboard

Project of llm evaluation to Japanese tasks
67 stars 34 forks source link

Nejumi Leaderboard 3

Overview

This repository is for the Nejumi Leaderboard 3, a comprehensive evaluation platform for large language models. The leaderboard assesses both general language capabilities and alignment aspects. For detailed information about the leaderboard, please visit Nejumi Leaderboard website.

Evaluation Metrics

Our evaluation framework incorporates a diverse set of metrics to provide a holistic assessment of model performance:

Main Category Subcategory Automated Evaluation with Correct Data AI Evaluation Note
General Language Processing Expression MT-bench/roleplay (0shot)
MT-bench/humanities (0shot)
MT-bench/writing (0shot)
^ Translation ALT e-to-j (jaster) (0shot, 2shot)
ALT j-to-e (jaster) (0shot, 2shot)
wikicorpus-e-to-j(jaster) (0shot, 2shot)
wikicorpus-j-to-e(jaster) (0shot, 2shot)
^ Summarization
^ Information Extraction JSQuaD (jaster) (0shot, 2shot)
^ Reasoning MT-bench/reasoning (0shot)
^ Mathematical Reasoning MAWPS(jaster) (0shot, 2shot)
MGSM
(jaster) (0shot, 2shot)
MT-bench/math (0shot)
^ (Entity) Extraction wiki_ner(jaster) (0shot, 2shot)
wiki_coreference(jaster) (0shot, 2shot)
chABSA
(jaster) (0shot, 2shot)
MT-bench/extraction (0shot)
^ Knowledge / Question Answering JCommonsenseQA(jaster) (0shot, 2shot)
JEMHopQA
(jaster) (0shot, 2shot)
JMMLU(0shot, 2shot)
NIILC
(jaster) (0shot, 2shot)
aio*(jaster) (0shot, 2shot)
MT-bench/stem (0shot)
^ English MMLU_en (0shot, 2shot)
^ semantic analysis JNLI(jaster) (0shot, 2shot)
JaNLI
(jaster) (0shot, 2shot)
JSeM(jaster) (0shot, 2shot)
JSICK
(jaster) (0shot, 2shot)
Jamp*(jaster) (0shot, 2shot)
^ syntactic analysis JCoLA-in-domain(jaster) (0shot, 2shot)
JCoLA-out-of-domain
(jaster) (0shot, 2shot)
JBLiMP(jaster) (0shot, 2shot)
wiki_reading
(jaster) (0shot, 2shot)
wiki_pas(jaster) (0shot, 2shot)
wiki_dependency
(jaster) (0shot, 2shot)
Alignment Controllability jaster* (0shot, 2shot)
LCTG
LCTG cannot be used for business purposes. Usage for research and using the result in the press release are acceptable.
^ Ethics/Moral JCommonsenseMorality*(2shot)
^ Toxicity LINE Yahoo Reliability Evaluation Benchmark This dataset is not publicly available due to its sensitive content.
^ Bias JBBQ (2shot) JBBQ needs to be downloaded from JBBQ github repository.
^ Truthfulness JTruthfulQA For JTruthfulQA evaluation, nlp-waseda/roberta_jtruthfulqa requires Juman++ to be installed beforehand. You can install it by running the script/install_jumanpp.sh script.
^ Robustness Test multiple patterns against JMMLU (W&B original) (0shot, 2shot)
- Standard method
- Choices are symbols
- Select anything but the correct answer

Implementation Guide

Environment Setup

  1. Set up environment variables

    export WANDB_API_KEY=<your WANDB_API_KEY>
    export OPENAI_API_KEY=<your OPENAI_API_KEY>
    export LANG=ja_JP.UTF-8
    # If using Azure OpenAI instead of standard OpenAI
    export AZURE_OPENAI_ENDPOINT=<your AZURE_OPENAI_ENDPOINT>
    export AZURE_OPENAI_API_KEY=<your AZURE_OPENAI_API_KEY>
    export OPENAI_API_TYPE=azure
    # if needed, set the following API KEY too
    export ANTHROPIC_API_KEY=<your ANTHROPIC_API_KEY>
    export GOOGLE_API_KEY=<your GOOGLE_API_KEY>
    export COHERE_API_KEY=<your COHERE_API_KEY>
    export MISTRAL_API_KEY=<your MISTRAL_API_KEY>
    export AWS_ACCESS_KEY_ID=<your AWS_ACCESS_KEY_ID>
    export AWS_SECRET_ACCESS_KEY=<your AWS_SECRET_ACCESS_KEY>
    export AWS_DEFAULT_REGION=<your AWS_DEFAULT_REGION>
    export UPSTAGE_API_KEY=<your UPSTAGE_API_KEY>
    # if needed, please login in huggingface
    huggingface-cli login
  2. Clone the repository

    git clone https://github.com/wandb/llm-leaderboard.git
    cd llm-leaderboard
  3. Set up a Python environment with requirements.txt

Dataset Preparation

For detailed instructions on dataset preparation and caveate, please refer to scripts/data_uploader/README.md.

In Nejumi Leadeboard3, the following dataset are used.

Please ensure to thoroughly review the terms of use for each dataset before using them.

  1. jaster(Apache-2.0 license)
  2. MT-Bench-JA (Apache-2.0 license)
  3. LCTG (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Permission from AI shift to use for the leaderboard and was received.)
  4. JBBQ (Creative Commons Attribution 4.0 International License.)
  5. LINE Yahoo Inappropriate Speech Evaluation Dataset (not publically available)
  6. JTruthfulQA (Creative Commons Attribution 4.0 International License.)

Configuration

Base configuration

The base_config.yaml file contains basic settings, and you can create a separate YAML file for model-specific settings. This allows for easy customization of settings for each model while maintaining a consistent base configuration.

Below, you will find a detailed description of the variables utilized in the base_config.yaml file.

Model configuration

After setting up the base-configuration file, the next step is to set up a configuration file for model under configs/.

API Model Configurations

This framework supports evaluating models using APIs such as OpenAI, Anthropic, Google, and Cohere. You need to create a separate config file for each API model. For example, the config file for OpenAI's gpt-4o-2024-05-13 would be named configs/config-gpt-4o-2024-05-13.yaml.

Other Model Configurations

This framework also supports evaluating models using VLLM. You need to create a separate config file for each VLLM model. For example, the config file for Microsoft's Phi-3-medium-128k-instruct would be named configs/config-Phi-3-medium-128k-instruct.yaml.

Create Chat template (needed for models except for API)

  1. create chat_templates/model_id.jinja If the chat_template is specified in the tokenizer_config.json of the evaluation model, create a .jinja file with that configuration. If chat_template is not specified in tokenizer_config.json, refer to the model card or other relevant documentation to create a chat_template and document it in a .jinja file.

  2. test chat_templates If you want to check the output of the chat_templates, you can use the following script:

    python3 scripts/test_chat_template.py -m <model_id> -c <chat_template>

    If the model ID and chat_template are the same, you can omit -c .

Evaluation Execution

Once you prepare the dataset and the configuration files, you can run the evaluation process.

You can use either -c or -s option:

The results of the evaluation will be logged to the specified W&B project.

When you want to edit runs or add additional evaluation metrics

Please refer to belend_run_configs/README.md.

Contributing

Contributions to this repository is welcom. Please submit your suggestions via pull requests. Please note that we may not accept all pull requests.

License

This repository is available for commercial use. However, please adhere to the respective rights and licenses of each evaluation dataset used.

Contact

For questions or support, please concatct to contact-jp@wandb.com.