logikon-ai / cot-eval

A framework for evaluating the effectiveness of chain-of-thought reasoning in language models.
https://huggingface.co/spaces/logikon/open_cot_leaderboard
MIT License
5 stars 1 forks source link

Evaluate: core42/jais-XX #43

Open ggbetz opened 3 months ago

ggbetz commented 3 months ago

For XX in [13b, 13b-chat, 30b-v3, 30b-chat-v3]:

Check upon issue creation:

Parameters:

NEXT_MODEL_PATH=core42/jais-{XX}
NEXT_MODEL_REVISION=main
NEXT_MODEL_PRECISION=bfloat16
MAX_LENGTH=2048 
GPU_MEMORY_UTILIZATION=0.7
VLLM_SWAP_SPACE=8

ToDos:

yakazimir commented 2 months ago

Looks like a tricky one here, will look into where this is coming in:

2024-05-10T22:36:35.181385695Z INFO 05-10 22:36:35 selector.py:16] Using FlashAttention backend.
2024-05-10T22:36:36.885485950Z (RayWorkerVllm pid=7595) INFO 05-10 22:36:36 selector.py:16] Using FlashAttention backend.
2024-05-10T22:36:36.885536750Z (RayWorkerVllm pid=7595) INFO 05-10 22:36:36 pynccl_utils.py:45] vLLM is using nccl==2.18.1
2024-05-10T22:36:36.885543520Z INFO 05-10 22:36:36 pynccl_utils.py:45] vLLM is using nccl==2.18.1
2024-05-10T22:36:41.312005618Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44] Error executing method load_model. This might cause deadlock in distributed execution.
2024-05-10T22:36:41.312037008Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44] Traceback (most recent call last):
2024-05-10T22:36:41.312043168Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/ray_utils.py", line 37, in execute_method
2024-05-10T22:36:41.312049198Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44]     return executor(*args, **kwargs)
2024-05-10T22:36:41.312054648Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 107, in load_model
2024-05-10T22:36:41.312060448Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44]     self.model_runner.load_model()
2024-05-10T22:36:41.312065698Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 95, in load_model
2024-05-10T22:36:41.312071528Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44]     self.model = get_model(
2024-05-10T22:36:41.312098688Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader.py", line 91, in get_model
2024-05-10T22:36:41.312104668Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44]     model = model_class(model_config.hf_config, linear_method)
2024-05-10T22:36:41.312110048Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/jais.py", line 270, in __init__
2024-05-10T22:36:41.312115687Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44]     self.transformer = JAISModel(config, linear_method)
2024-05-10T22:36:41.312120977Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/jais.py", line 230, in __init__
2024-05-10T22:36:41.312126507Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44]     self.h = nn.ModuleList([
2024-05-10T22:36:41.312132687Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/jais.py", line 231, in <listcomp>
2024-05-10T22:36:41.312138737Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44]     JAISBlock(config, linear_method)
2024-05-10T22:36:41.312144097Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/jais.py", line 183, in __init__
2024-05-10T22:36:41.312149707Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44]     self.mlp = JAISMLP(inner_dim, config, linear_method)
2024-05-10T22:36:41.312155027Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/jais.py", line 137, in __init__
2024-05-10T22:36:41.312160747Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44]     self.c_fc = ColumnParallelLinear(
2024-05-10T22:36:41.312165967Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 173, in __init__
2024-05-10T22:36:41.312171587Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44]     self.output_size_per_partition = divide(output_size, tp_size)
2024-05-10T22:36:41.312176897Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/parallel_utils/utils.py", line 19, in divide
2024-05-10T22:36:41.312182467Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44]     ensure_divisibility(numerator, denominator)
2024-05-10T22:36:41.312187737Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/parallel_utils/utils.py", line 12, in ensure_divisibility
2024-05-10T22:36:41.312199477Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44]     assert numerator % denominator == 0, "{} is not divisible by {}".format(
2024-05-10T22:36:41.312205177Z (RayWorkerVllm pid=7595) ERROR 05-10 22:36:41 ray_utils.py:44] AssertionError: 13653 is not divisible by 4
2024-05-10T22:36:41.312211187Z (RayWorkerVllm pid=7380) INFO 05-10 22:36:36 selector.py:16] Using FlashAttention backend. [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
2024-05-10T22:36:41.313330131Z Traceback (most recent call last):
2024-05-10T22:36:41.313366891Z   File "/usr/local/bin/cot-eval", line 8, in <module>
2024-05-10T22:36:41.313526370Z     sys.exit(main())
2024-05-10T22:36:41.313550730Z   File "/workspace/cot-eval/src/cot_eval/__main__.py", line 149, in main
2024-05-10T22:36:41.313593179Z     llm = VLLM(
2024-05-10T22:36:41.313605389Z   File "/usr/local/lib/python3.10/dist-packages/langchain_core/load/serializable.py", line 120, in __init__
2024-05-10T22:36:41.313659589Z     super().__init__(**kwargs)
2024-05-10T22:36:41.313672039Z   File "/usr/local/lib/python3.10/dist-packages/pydantic/v1/main.py", line 341, in __init__
2024-05-10T22:36:41.313752498Z     raise validation_error
2024-05-10T22:36:41.313823148Z pydantic.v1.error_wrappers.ValidationError: 1 validation error for VLLM
2024-05-10T22:36:41.313833768Z __root__
2024-05-10T22:36:41.313839598Z   13653 is not divisible by 4 (type=assertion_error)
2024-05-10T22:36:43.599831555Z (RayWorkerVllm pid=7380) INFO 05-10 22:36:36 pynccl_utils.py:45] vLLM is using nccl==2.18.1 [repeated 2x across cluster]
2024-05-10T22:36:43.599894725Z (RayWorkerVllm pid=7380) ERROR 05-10 22:36:41 ray_utils.py:44] Error executing method load_model. This might cause deadlock in distributed execution. [repeated 2x across cluster]
2024-05-10T22:36:43.599908335Z (RayWorkerVllm pid=7380) ERROR 05-10 22:36:41 ray_utils.py:44] Traceback (most recent call last): [repeated 2x across cluster]
2024-05-10T22:36:43.599914535Z (RayWorkerVllm pid=7380) ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/ray_utils.py", line 37, in execute_method [repeated 2x across cluster]
2024-05-10T22:36:43.599923215Z (RayWorkerVllm pid=7380) ERROR 05-10 22:36:41 ray_utils.py:44]     return executor(*args, **kwargs) [repeated 2x across cluster]
2024-05-10T22:36:43.599931985Z (RayWorkerVllm pid=7380) ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 95, in load_model [repeated 4x across cluster]
2024-05-10T22:36:43.599940855Z (RayWorkerVllm pid=7380) ERROR 05-10 22:36:41 ray_utils.py:44]     self.model_runner.load_model() [repeated 2x across cluster]
2024-05-10T22:36:43.599977715Z (RayWorkerVllm pid=7380) ERROR 05-10 22:36:41 ray_utils.py:44]     self.model = get_model( [repeated 2x across cluster]
2024-05-10T22:36:43.599983425Z (RayWorkerVllm pid=7380) ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader.py", line 91, in get_model [repeated 2x across cluster]
2024-05-10T22:36:43.599992975Z (RayWorkerVllm pid=7380) ERROR 05-10 22:36:41 ray_utils.py:44]     model = model_class(model_config.hf_config, linear_method) [repeated 2x across cluster]
2024-05-10T22:36:43.599998554Z (RayWorkerVllm pid=7380) ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 173, in __init__ [repeated 10x across cluster]
2024-05-10T22:36:43.600010064Z (RayWorkerVllm pid=7380) ERROR 05-10 22:36:41 ray_utils.py:44]     self.transformer = JAISModel(config, linear_method) [repeated 2x across cluster]
2024-05-10T22:36:43.600016424Z (RayWorkerVllm pid=7380) ERROR 05-10 22:36:41 ray_utils.py:44]     self.h = nn.ModuleList([ [repeated 2x across cluster]
2024-05-10T22:36:43.600022014Z (RayWorkerVllm pid=7380) ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/jais.py", line 231, in <listcomp> [repeated 2x across cluster]
2024-05-10T22:36:43.600032144Z (RayWorkerVllm pid=7380) ERROR 05-10 22:36:41 ray_utils.py:44]     JAISBlock(config, linear_method) [repeated 2x across cluster]
2024-05-10T22:36:43.600040194Z (RayWorkerVllm pid=7380) ERROR 05-10 22:36:41 ray_utils.py:44]     self.mlp = JAISMLP(inner_dim, config, linear_method) [repeated 2x across cluster]
2024-05-10T22:36:43.600045794Z (RayWorkerVllm pid=7380) ERROR 05-10 22:36:41 ray_utils.py:44]     self.c_fc = ColumnParallelLinear( [repeated 2x across cluster]
2024-05-10T22:36:43.600054054Z (RayWorkerVllm pid=7380) ERROR 05-10 22:36:41 ray_utils.py:44]     self.output_size_per_partition = divide(output_size, tp_size) [repeated 2x across cluster]
2024-05-10T22:36:43.600059674Z (RayWorkerVllm pid=7380) ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/parallel_utils/utils.py", line 19, in divide [repeated 2x across cluster]
2024-05-10T22:36:43.600068564Z (RayWorkerVllm pid=7380) ERROR 05-10 22:36:41 ray_utils.py:44]     ensure_divisibility(numerator, denominator) [repeated 2x across cluster]
2024-05-10T22:36:43.600076354Z (RayWorkerVllm pid=7380) ERROR 05-10 22:36:41 ray_utils.py:44]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/parallel_utils/utils.py", line 12, in ensure_divisibility [repeated 2x across cluster]
2024-05-10T22:36:43.600085134Z (RayWorkerVllm pid=7380) ERROR 05-10 22:36:41 ray_utils.py:44]     assert numerator % denominator == 0, "{} is not divisible by {}".format( [repeated 2x across cluster]
2024-05-10T22:36:43.600100124Z (RayWorkerVllm pid=7380) ERROR 05-10 22:36:41 ray_utils.py:44] AssertionError: 13653 is not divisible by 4 [repeated 2x across cluster]
ggbetz commented 2 months ago

Yes, might be tricky because I just tried to load core42/jais-13b-chat on a single NVIDIA A100-SXM4-40GB and run inference with VLLM, which worked fine.