Closed eu9ene closed 2 months ago
@marco-c FYI since you already started some work on this. I wanted to run an evaluation using our tools at some point.
Hi @eu9ene @marco-c if you're still trying to benchmark / eval llms
I'm the maintainer of LiteLLM. I believe LiteLLM makes it easier for you to run benchmarks and evaluate LLMs (I'd love your feedback if it does not)
Try it here: https://docs.litellm.ai/docs/simple_proxy https://github.com/BerriAI/litellm
Ollama models
$ litellm --model ollama/llama2 --api_base http://localhost:11434
Hugging Face Models
$ export HUGGINGFACE_API_KEY=my-api-key #[OPTIONAL]
$ litellm --model claude-instant-1
Anthropic
$ export ANTHROPIC_API_KEY=my-api-key
$ litellm --model claude-instant-1
Palm
$ export PALM_API_KEY=my-palm-key
$ litellm --model palm/chat-bison
openai.api_base = "http://0.0.0.0:8000"
python3 -m lm_eval \
--model openai-completions \
--model_args engine=davinci \
--task crows_pairs_english_age
Another area where this could be useful is helping us defining cleaning rules for datasets.
See also https://arxiv.org/pdf/2302.14520.pdf.
See also https://arxiv.org/pdf/2302.14520.pdf.
This one is on my list :)
"A PARADIGM SHIFT IN MACHINE TRANSLATION: BOOSTING TRANSLATION PERFORMANCE OF LARGE LANGUAGE MODELS" is another interesting one
I analyzed the results of https://arxiv.org/pdf/2302.09210.pdf and https://arxiv.org/pdf/2309.11674.pdf and also benchmarked ALMA-13B-LoRA myself. The quality is pretty good and looks on par with Google API for xx-en and slightly worse for en-xx:
The main problem from a practical perspective is inference cost (for the API) and speed (for the fine-tuned llama-2 13B). Most of our pipelines (cleaning, decoding etc.) work on hundreds of millions of parallel sentences. The inference speed on 8 x q6000 (24GB) was 13 sec per batch of 10 sentences. I played a bit with the batch size but couldn't get significantly better results.
I guess it's possible to optimize further but it seems it's practical to use LLMs only for the tasks that are limited to small samples, for example, analysis of evaluation or tuning thresholds for cleaning and similar.
@eu9ene Can you link to the spreadsheet? Seems like something we could share publicly.
Doing some quick estimates based on your numbers.
Sentence Count | Hours to Train | $1/hr GPU | $5/hr GPU |
---|---|---|---|
1,000,000 | 21.37 | $21.37 | $106.84 |
10,000,000 | 213.68 | $213.68 | $1,068.38 |
100,000,000 | 2136.75 | $2,136.75 | $10,683.76 |
Sure, here it is but that's basically it. I wanted to also benchmark for WM23 but didn't have time for it.
For the out-of-the-envelope calculations: for this mono task we translated 13143105 lines in 197969 seconds = 66 lines per second. It was performed on V100 x 4 which is roughly equivalent to q6000 x 8 in practice. So for this 13B LLM it would be 1 line/sec which is almost two orders of magnitude lower, especially if we'll be able to further optimize inference of the Marian models with CTranslate.
UPD: I see that I did try to translate WMT23 but it seems the model failed on some examples so it's not possible to calculate the metrics correctly.
The empty cells for some languages are where the model failed to follow the prompt and translate all the examples, so like with all other LLM tasks it's not 100% reliable.
Here's the code I used if you're interested. It utilizes all 8 GPUs (not on 100% though) and allocates 20GB of GPU memory for each GPU. There are ways to run faster for example with vLLM but it won't change things dramatically since the model has 67 times more parameters than our teacher.
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM
from transformers import LlamaTokenizer, AutoTokenizer
from tqdm import tqdm
import toolz
import os
model = AutoModelForCausalLM.from_pretrained("haoranxu/ALMA-13B-R", torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("haoranxu/ALMA-13B-R", padding_side='left')
def translate_batch(texts, from_lang, to_lang):
try:
prompts=["Translate this from "+from_lang+" to "+to_lang+":\n"+from_lang+":\n "+ text +"\n"+to_lang+":\n" for text in texts]
input_ids = tokenizer(prompts, return_tensors="pt", padding=True, max_length=300, truncation=True).input_ids.cuda()
# Translation
with torch.no_grad():
generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=300, do_sample=True, temperature=0.6, top_p=0.9)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
results = []
# print(outputs)
for output in outputs:
parts = output.split("\n")
assert len(parts) == 5
results.append( parts[-1])
return results
except Exception as ex:
print(ex)
print(output)
raise
langs = [
('ru', 'en', 'Russian', 'English'),
('en', 'ru', 'English', 'Russian'),
('en', 'de', 'English', 'German'),
('de', 'en', 'German', 'English')
]
# datasets = ['wmt22', 'wmt23']
datasets = [ 'wmt23']
model_name='almar-r'
BATCH_SIZE=10
for dataset in datasets:
for from_code, to_code, from_lang, to_lang in langs:
output = f'{from_code}-{to_code}/{dataset}.{from_code}-{to_code}.translations.{model_name}.{to_code}'
# if os.path.isfile(output):
# print(f'Skipping, {output} already exists')
# continue
print(f'translating {from_lang} to {to_lang} for {dataset}')
with open(f'{from_code}-{to_code}/{dataset}.{from_code}-{to_code}.{from_code}') as f:
lines = [l.strip() for l in f.readlines()]
try:
batch_res = []
for batch in tqdm(list(toolz.partition_all(BATCH_SIZE, lines))):
batch_res.append(translate_batch(batch, from_lang, to_lang))
translations = []
for res in batch_res:
translations.extend(res)
with open(output, 'w') as f:
f.write('\n'.join(translations))
except Exception as ex:
print(f'Error while translating: {ex}')
If practical, the LLMs might be useful for a variety of tasks:
As a first step, we can translate test datasets for some languages we support and calculate metrics to understand where the models stand.
We should also measure the speed of translation to understand whether it will be practical at all to use LLMs (translating millions of sentences for augmentation might be too slow for example)
We can look at:
We can also take a look at specialized multilingual models with permissive licences.
There is also this paper with some benchmarks https://arxiv.org/pdf/2302.09210.pdf but it's old and checks only GPT3.5