logikon-ai / cot-eval

A framework for evaluating the effectiveness of chain-of-thought reasoning in language models.
https://huggingface.co/spaces/logikon/open_cot_leaderboard
MIT License
7 stars 1 forks source link

Together.ai support #2

Open yakazimir opened 6 months ago

yakazimir commented 6 months ago

@ggbetz Adding this so we don't forget.

This includes

ggbetz commented 6 months ago

Thx @yakazimir!

I've started some tests with lm-eval harness v0.4.1 and together ai.

# Task Model Interface Passed
1 gsm8k google/gemma-2b openai-chat-completions
2 logiqa2_base google/gemma-2b openai-chat-completions
3 gsm8k google/gemma-2b openai-completions

1

lm_eval --model openai-chat-completions \
  --tasks gsm8k \
  --model_args model=google/gemma-2b,base_url=https://api.together.xyz \
  --limit 10

2

lm_eval --model openai-chat-completions \
  --tasks logiqa2_base \
  --model_args model=google/gemma-2b,base_url=https://api.together.xyz \
  --limit 10 \
  --include_path ./tasks

fails -- as expected -- with

...
2024-02-29:08:57:04,412 INFO     [task.py:363] Building contexts for task on rank 0...
2024-02-29:08:57:04,420 INFO     [evaluator.py:324] Running loglikelihood requests
Traceback (most recent call last):
  ... 
  File "/content/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 499, in loglikelihood
    raise NotImplementedError("No support for logits.")
NotImplementedError: No support for logits.

3

lm_eval --model openai-completions \
  --tasks gsm8k \
  --model_args model=google/gemma-2b,tokenizer_backend=huggingface,tokenizer=google/gemma-2b,base_url=https://api.together.xyz/v1 \
  --limit 10

fails with:

2024-02-29:09:15:43,811 INFO     [_base_client.py:952] Retrying request to /completions in 0.788895 seconds
2024-02-29:09:15:44,778 INFO     [_client.py:1026] HTTP Request: POST https://api.together.xyz/v1/completions "HTTP/1.1 500 Internal Server Error"
2024-02-29:09:15:44,778 INFO     [_base_client.py:952] Retrying request to /completions in 1.621023 seconds
2024-02-29:09:15:46,576 INFO     [_client.py:1026] HTTP Request: POST https://api.together.xyz/v1/completions "HTTP/1.1 500 Internal Server Error"
Traceback (most recent call last):
  File "/content/lm-evaluation-harness/lm_eval/utils.py", line 762, in wrapper
    return func(*args, **kwargs)
  File "/content/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 70, in completion
    return client.completions.create(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/openai/_utils/_utils.py", line 303, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/openai/resources/completions.py", line 559, in create
    return self._post(
  File "/usr/local/lib/python3.10/dist-packages/openai/_base_client.py", line 1088, in post
    return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
  File "/usr/local/lib/python3.10/dist-packages/openai/_base_client.py", line 853, in request
    return self._request(
  File "/usr/local/lib/python3.10/dist-packages/openai/_base_client.py", line 916, in _request
    return self._retry_request(
  File "/usr/local/lib/python3.10/dist-packages/openai/_base_client.py", line 958, in _retry_request
    return self._request(
  File "/usr/local/lib/python3.10/dist-packages/openai/_base_client.py", line 916, in _request
    return self._retry_request(
  File "/usr/local/lib/python3.10/dist-packages/openai/_base_client.py", line 958, in _retry_request
    return self._request(
  File "/usr/local/lib/python3.10/dist-packages/openai/_base_client.py", line 930, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.InternalServerError: Error code: 500 - {'error': {'message': 'Request failed with status code 422', 'type': 'server_error', 'param': None, 'code': None}}
ggbetz commented 6 months ago

OK, so together ai say here explicitly that they support chat models, and, by implicature, don't support the completion endpoint for openai library. But with chat models, we don't get logits and eval of multiple choice in harness doesn't work.

Solutions: