Evaluate: openchat/openchat-3.5-0106-gemma

ggbetz commented 6 months ago

Check upon issue creation:

[x] The model has not been evaluated yet and doesn't show up on the CoT Leaderboard.
[x] There is no evaluation request issue for the model in the repo.
[x] The parameters below have been adapted and shall be used.

Parameters:

NEXT_MODEL_PATH=openchat/openchat-3.5-0106-gemma
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.8
VLLM_SWAP_SPACE=6

ToDos:

[ ] Run cot-eval pipeline
[ ] Merge pull requests for cot-eval results datats (> @ggbetz)
[ ] Create eval request record to update metadata on leaderboard (> @ggbetz)

yakazimir commented 5 months ago

I got the following error:


2024-05-07T20:37:06.053752596Z 
Processed prompts:   0%|          | 0/2 [00:00<?, ?it/s]
Processed prompts:  50%|█████     | 1/2 [00:01<00:01,  1.07s/it]
Processed prompts: 100%|██████████| 2/2 [00:01<00:00,  1.83it/s]
2024-05-07T20:37:06.057030982Z 2024-05-07 20:37:06,056 - root - INFO - Tested COT chain: ['\nThe passage states that Peter fell from a tree. However, it does not provide any information on whether Peter was injured or not during the fall. Without further information, we cannot determine if Peter is injured.\n', '\nThe passage states that "Peter likes math." However, it does not provide any information about whether Peter likes Punk or not. Therefore, based on the given information, we cannot determine if Peter likes Punk.\n']
2024-05-07T20:37:06.057064485Z 2024-05-07 20:37:06,056 - root - INFO - Running COT chain HandsOn on logiqa
2024-05-07T20:37:07.616756906Z 
Map:   0%|          | 0/626 [00:00<?, ? examples/s]
2024-05-07T20:37:10.233665539Z 
Processed prompts:   0%|          | 0/626 [00:00<?, ?it/s][A
Map:   0%|          | 0/626 [00:04<?, ? examples/s]
Map:   0%|          | 0/626 [00:04<?, ? examples/s]
2024-05-07T20:37:10.236994772Z Traceback (most recent call last):
2024-05-07T20:37:10.237021344Z   File "/usr/local/bin/cot-eval", line 8, in <module>
2024-05-07T20:37:10.237028973Z     sys.exit(main())
2024-05-07T20:37:10.237034886Z   File "/workspace/cot-eval/src/cot_eval/__main__.py", line 171, in main
2024-05-07T20:37:10.237040543Z     cot_data[task] = run_chain_on_task(task_data[task], chain)
2024-05-07T20:37:10.237065952Z   File "/workspace/cot-eval/src/cot_eval/__main__.py", line 105, in run_chain_on_task
2024-05-07T20:37:10.237077100Z     task_ds = task_ds.map(add_reasoning, batched=True, batch_size=2048, load_from_cache_file=False)
2024-05-07T20:37:10.237082837Z   File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 593, in wrapper
2024-05-07T20:37:10.237113611Z     out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
2024-05-07T20:37:10.237142733Z   File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 558, in wrapper
2024-05-07T20:37:10.237182783Z     out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
2024-05-07T20:37:10.237192437Z   File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3105, in map
2024-05-07T20:37:10.237520385Z     for rank, done, content in Dataset._map_single(**dataset_kwargs):
2024-05-07T20:37:10.237529919Z   File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3482, in _map_single
2024-05-07T20:37:10.237868272Z     batch = apply_function_on_filtered_inputs(
2024-05-07T20:37:10.237876900Z   File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3361, in apply_function_on_filtered_inputs
2024-05-07T20:37:10.238209430Z     processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
2024-05-07T20:37:10.238222230Z   File "/workspace/cot-eval/src/cot_eval/__main__.py", line 102, in add_reasoning
2024-05-07T20:37:10.238229304Z     reasoning_traces = chain.batch(input_batch)
2024-05-07T20:37:10.238234475Z   File "/usr/local/lib/python3.10/dist-packages/langchain_core/runnables/base.py", line 2643, in batch
2024-05-07T20:37:10.238489333Z     inputs = step.batch(
2024-05-07T20:37:10.238497886Z   File "/usr/local/lib/python3.10/dist-packages/langchain_core/runnables/base.py", line 4550, in batch
2024-05-07T20:37:10.238905846Z     return self.bound.batch(
2024-05-07T20:37:10.238914972Z   File "/usr/local/lib/python3.10/dist-packages/langchain_core/language_models/llms.py", line 340, in batch
2024-05-07T20:37:10.238946108Z     raise e
2024-05-07T20:37:10.238954331Z   File "/usr/local/lib/python3.10/dist-packages/langchain_core/language_models/llms.py", line 327, in batch
2024-05-07T20:37:10.239000953Z     llm_result = self.generate_prompt(
2024-05-07T20:37:10.239011126Z   File "/usr/local/lib/python3.10/dist-packages/langchain_core/language_models/llms.py", line 633, in generate_prompt
2024-05-07T20:37:10.239057588Z     return self.generate(prompt_strings, stop=stop, callbacks=callbacks, **kwargs)
2024-05-07T20:37:10.239066194Z   File "/usr/local/lib/python3.10/dist-packages/langchain_core/language_models/llms.py", line 803, in generate
2024-05-07T20:37:10.239139666Z     output = self._generate_helper(
2024-05-07T20:37:10.239148888Z   File "/usr/local/lib/python3.10/dist-packages/langchain_core/language_models/llms.py", line 670, in _generate_helper
2024-05-07T20:37:10.239205341Z     raise e
2024-05-07T20:37:10.239221190Z   File "/usr/local/lib/python3.10/dist-packages/langchain_core/language_models/llms.py", line 657, in _generate_helper
2024-05-07T20:37:10.239277352Z     self._generate(
2024-05-07T20:37:10.239286795Z   File "/usr/local/lib/python3.10/dist-packages/langchain_community/llms/vllm.py", line 132, in _generate
2024-05-07T20:37:10.239306610Z     outputs = self.client.generate(prompts, sampling_params)
2024-05-07T20:37:10.239315918Z   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 190, in generate
2024-05-07T20:37:10.239343003Z     return self._run_engine(use_tqdm)
2024-05-07T20:37:10.239353506Z   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 218, in _run_engine
2024-05-07T20:37:10.239396518Z     step_outputs = self.llm_engine.step()
2024-05-07T20:37:10.239421778Z   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 683, in step
2024-05-07T20:37:10.239461386Z     return self._process_model_outputs(output, scheduler_outputs)
2024-05-07T20:37:10.239470419Z   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 601, in _process_model_outputs
2024-05-07T20:37:10.239530535Z     self._process_sequence_group_outputs(seq_group, outputs)
2024-05-07T20:37:10.239539869Z   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 468, in _process_sequence_group_outputs
2024-05-07T20:37:10.239582441Z     self.detokenizer.decode_sequence_inplace(seq,
2024-05-07T20:37:10.239591320Z   File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/detokenizer.py", line 113, in decode_sequence_inplace
2024-05-07T20:37:10.239613415Z     read_offset) = detokenize_incrementally(
2024-05-07T20:37:10.239622530Z   File "/usr/local/lib/python3.10/dist-packages/vllm/transformers_utils/detokenizer.py", line 281, in detokenize_incrementally
2024-05-07T20:37:10.239662314Z     new_text = tokenizer.convert_tokens_to_string(
2024-05-07T20:37:10.239671226Z   File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_fast.py", line 612, in convert_tokens_to_string
2024-05-07T20:37:10.239802448Z     return self.backend_tokenizer.decoder.decode(tokens)
2024-05-07T20:37:10.239811143Z TypeError: argument 'tokens': 'NoneType' object cannot be converted to 'PyString'
2024-05-07T20:37:12.472972888Z ```

ggbetz commented 5 months ago

Ok. It's a model config bug seen with VLLM before: https://github.com/vllm-project/vllm/issues/516 Let's shelf this model then.

logikon-ai / cot-eval

Evaluate: openchat/openchat-3.5-0106-gemma #30