hsiehjackson / RULER

This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?
Apache License 2.0
321 stars 17 forks source link

How to test models with larger context length than 128K ? #14

Open yaswanth-iitkgp opened 1 month ago

yaswanth-iitkgp commented 1 month ago

Hi @hsiehjackson ,

I tried using your repo to test HF models like gradientai/Llama-3-8B-Instruct-Gradient-1048k , but couldn't load the entire model on a single A100 GPU. I wanted to use accelerate library or anything to load the model for experiments greater than 32K (currently we can test upto 32K on my GPU). Would love to hear how we can achieve this.

hsiehjackson commented 1 month ago

Have you tried running with vLLM?

yaswanth-iitkgp commented 1 month ago

Yes, but I was unable to use vLLM with particular HF model. Any suggestions if we can load this using vLLM atleast?

hsiehjackson commented 1 month ago

You mean this one https://huggingface.co/gradientai/Llama-3-8B-Instruct-Gradient-1048k? I think if you can load Llama3-8B-instruct, then it should be possible to load this 1M model. Do you see any errors to load with vLLM?

yaswanth-iitkgp commented 1 month ago

Yes I meant that one. The problem is that when I use hf mode for loading Llama-3-8B any version, I can only do the tests upto 32K context, anything beyond that is not possible on a single A100 GPU(80Gb). But when I use the vllm mode I cannot use it with llama-3 with even 1k context length, I am attaching the logs in this output_llama3_vllm.log. I can use the vllm mode for other models like https://huggingface.co/THUDM/chatglm3-6b-128K.

hsiehjackson commented 1 month ago

From your log, looks like you plan to evaluate Llama-3-8B with sequence length 16K, and you can get the final prediction files. What errors you mention when you test beyond 1k context length?

yaswanth-iitkgp commented 1 month ago

I really appreciate your efforts to help solve this problem, thanks a lot.

Yes, but the scores are 0 and nulls are 496/500. So the error I was facing when using vllm mode is the initial part of the log file I attached earlier. So to make things clear I checked it again, then I realized that vllm was giving this error initially for smaller context and I used to stop it in between. Also this error I mentioned in that log file, happens randomly. So I was able to run experiments on 16K context length , but I cannot do it for 8K (i tried multiple times). So i really want to find the cause of that error too. But this error persists for context beyond 32k, I am attaching the log file for 64K context with Llama-3-8B-1M context by gradientai here output_llama3_1M_vllm_64k.log. I was assuming this error might be due to the GPU limitation of 80Gb, pls let me know if it is not.

Is there any way to test context beyond that barrier successfully on this GPU ?

hsiehjackson commented 1 month ago

Sorry I missed your reply. When you see the errors in your log, have you checked the logs from server side (vLLM)? You can check whether you got OOM or something.

yaswanth-iitkgp commented 1 month ago

Yes I see any OOM error on the vLLM, does this mean we cannot do experiments (on gradientai/Llama-3-8B-Instruct-Gradient-1048k) with 64k context length on a single A100 server ??

/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
INFO 05-24 22:18:09 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='gradientai/Llama-3-8B-Instruct-Gradient-1048k', tokenizer='gradientai/Llama-3-8B-Instruct-Gradient-1048k', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1048576, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 05-24 22:18:20 selector.py:16] Using FlashAttention backend.
INFO 05-24 22:18:21 weight_utils.py:177] Using model weights format ['*.safetensors']
INFO 05-24 22:18:27 model_runner.py:104] Loading model weights took 15.2075 GB
Traceback (most recent call last):
  File "/workspace/scripts/pred/serve_vllm.py", line 116, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 348, in from_engine_args
    engine = cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 311, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 422, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 110, in __init__
    self.model_executor = executor_class(model_config, cache_config,
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 40, in __init__
    self._init_cache()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 80, in _init_cache
    self.driver_worker.profile_num_available_blocks(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 131, in profile_num_available_blocks
    self.model_runner.profile_run()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 742, in profile_run
    self.execute_model(seqs, kv_caches)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 663, in execute_model
    hidden_states = model_executable(**execute_model_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 345, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 271, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 223, in forward
    hidden_states = self.mlp(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 75, in forward
    gate_up, _ = self.gate_up_proj(x)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 215, in forward
    output_parallel = self.linear_method.apply_weights(
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 79, in apply_weights
    return F.linear(x, weight, bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 GiB. GPU 0 has a total capacty of 79.15 GiB of which 47.23 GiB is free. Process 532467 has 31.90 GiB memory in use. Of the allocated memory 31.24 GiB is allocated by PyTorch, and 21.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
hsiehjackson commented 3 weeks ago

Yeap, looks like that. Have you tried quantization?

yaswanth-iitkgp commented 3 weeks ago

No, I haven't tried using a quantized version yet but tried few other GGUF models and it wasn't successful either.