Evaluate translation capabilities of LLMs

eu9ene commented 1 year ago

If practical, the LLMs might be useful for a variety of tasks:

Quality evaluation
Data augmentation (including back translation for low-resource languages)
Using as a teacher model

As a first step, we can translate test datasets for some languages we support and calculate metrics to understand where the models stand.

We should also measure the speed of translation to understand whether it will be practical at all to use LLMs (translating millions of sentences for augmentation might be too slow for example)

We can look at:

OpenAI GPT4
OpenAI GPT3.5
Llama2 13B
Llama2 7B
Quantized versions (8bit, 4bit) if llama is too slow

We can also take a look at specialized multilingual models with permissive licences.

There is also this paper with some benchmarks https://arxiv.org/pdf/2302.09210.pdf but it's old and checks only GPT3.5

eu9ene commented 1 year ago

@marco-c FYI since you already started some work on this. I wanted to run an evaluation using our tools at some point.

ishaan-jaff commented 11 months ago

Hi @eu9ene @marco-c if you're still trying to benchmark / eval llms

I'm the maintainer of LiteLLM. I believe LiteLLM makes it easier for you to run benchmarks and evaluate LLMs (I'd love your feedback if it does not)

Try it here: https://docs.litellm.ai/docs/simple_proxy https://github.com/BerriAI/litellm

Using LiteLLM Proxy Server

Creating a proxy server

Ollama models

$ litellm --model ollama/llama2 --api_base http://localhost:11434

Hugging Face Models

$ export HUGGINGFACE_API_KEY=my-api-key #[OPTIONAL]
$ litellm --model claude-instant-1

Anthropic

$ export ANTHROPIC_API_KEY=my-api-key
$ litellm --model claude-instant-1

Palm

$ export PALM_API_KEY=my-palm-key
$ litellm --model palm/chat-bison

Set api base to proxy

openai.api_base = "http://0.0.0.0:8000"

Using to run an eval on lm harness:

python3 -m lm_eval \
  --model openai-completions \
  --model_args engine=davinci \
  --task crows_pairs_english_age

marco-c commented 9 months ago

Another area where this could be useful is helping us defining cleaning rules for datasets.

marco-c commented 9 months ago

eu9ene commented 9 months ago

See also https://arxiv.org/pdf/2302.14520.pdf.

This one is on my list :)

"A PARADIGM SHIFT IN MACHINE TRANSLATION: BOOSTING TRANSLATION PERFORMANCE OF LARGE LANGUAGE MODELS" is another interesting one

eu9ene commented 8 months ago

WMT23: https://aclanthology.org/2023.wmt-1.1.pdf

eu9ene commented 2 months ago

I analyzed the results of https://arxiv.org/pdf/2302.09210.pdf and https://arxiv.org/pdf/2309.11674.pdf and also benchmarked ALMA-13B-LoRA myself. The quality is pretty good and looks on par with Google API for xx-en and slightly worse for en-xx:

The main problem from a practical perspective is inference cost (for the API) and speed (for the fine-tuned llama-2 13B). Most of our pipelines (cleaning, decoding etc.) work on hundreds of millions of parallel sentences. The inference speed on 8 x q6000 (24GB) was 13 sec per batch of 10 sentences. I played a bit with the batch size but couldn't get significantly better results.

I guess it's possible to optimize further but it seems it's practical to use LLMs only for the tasks that are limited to small samples, for example, analysis of evaluation or tuning thresholds for cleaning and similar.

gregtatum commented 2 months ago

@eu9ene Can you link to the spreadsheet? Seems like something we could share publicly.

gregtatum commented 2 months ago

Doing some quick estimates based on your numbers.

Sentence Count	Hours to Train	$1/hr GPU	$5/hr GPU
1,000,000	21.37	$21.37	$106.84
10,000,000	213.68	$213.68	$1,068.38
100,000,000	2136.75	$2,136.75	$10,683.76

eu9ene commented 2 months ago

Sure, here it is but that's basically it. I wanted to also benchmark for WM23 but didn't have time for it.

For the out-of-the-envelope calculations: for this mono task we translated 13143105 lines in 197969 seconds = 66 lines per second. It was performed on V100 x 4 which is roughly equivalent to q6000 x 8 in practice. So for this 13B LLM it would be 1 line/sec which is almost two orders of magnitude lower, especially if we'll be able to further optimize inference of the Marian models with CTranslate.

UPD: I see that I did try to translate WMT23 but it seems the model failed on some examples so it's not possible to calculate the metrics correctly.

eu9ene commented 2 months ago

The empty cells for some languages are where the model failed to follow the prompt and translate all the examples, so like with all other LLM tasks it's not 100% reliable.

eu9ene commented 2 months ago

Here's the code I used if you're interested. It utilizes all 8 GPUs (not on 100% though) and allocates 20GB of GPU memory for each GPU. There are ways to run faster for example with vLLM but it won't change things dramatically since the model has 67 times more parameters than our teacher.

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM
from transformers import LlamaTokenizer, AutoTokenizer
from tqdm import tqdm
import toolz
import os

model = AutoModelForCausalLM.from_pretrained("haoranxu/ALMA-13B-R", torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("haoranxu/ALMA-13B-R", padding_side='left')

def translate_batch(texts, from_lang, to_lang):
    try:
        prompts=["Translate this from "+from_lang+" to "+to_lang+":\n"+from_lang+":\n "+ text +"\n"+to_lang+":\n" for text in texts]
        input_ids = tokenizer(prompts, return_tensors="pt", padding=True, max_length=300, truncation=True).input_ids.cuda()

        # Translation
        with torch.no_grad():
            generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=300, do_sample=True, temperature=0.6, top_p=0.9)
        outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

        results = []
        # print(outputs)
        for output in outputs:
            parts = output.split("\n")
            assert len(parts) == 5
            results.append( parts[-1])
        return results
    except Exception as ex:
        print(ex)
        print(output)
        raise

langs = [
    ('ru', 'en', 'Russian', 'English'),
    ('en', 'ru', 'English', 'Russian'),
    ('en', 'de', 'English', 'German'),    
    ('de', 'en', 'German', 'English')
]

# datasets = ['wmt22', 'wmt23']
datasets = [ 'wmt23']
model_name='almar-r'

BATCH_SIZE=10

for dataset in datasets:
    for from_code, to_code, from_lang, to_lang in langs:
        output = f'{from_code}-{to_code}/{dataset}.{from_code}-{to_code}.translations.{model_name}.{to_code}'
        # if os.path.isfile(output):
        #     print(f'Skipping, {output} already exists')
        #     continue

        print(f'translating {from_lang} to {to_lang} for {dataset}')

        with open(f'{from_code}-{to_code}/{dataset}.{from_code}-{to_code}.{from_code}') as f:
            lines = [l.strip() for l in f.readlines()]

        try:
            batch_res = []
            for batch in tqdm(list(toolz.partition_all(BATCH_SIZE, lines))):
                batch_res.append(translate_batch(batch, from_lang, to_lang))

            translations = []
            for res in batch_res:
                translations.extend(res)

            with open(output, 'w') as f:
                f.write('\n'.join(translations))
        except Exception as ex:
            print(f'Error while translating: {ex}')

mozilla / firefox-translations-training