XVERSE-13B-256K Model doesn't work properly in web_demo.py

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

python src/web_demo.py \
    --model_name_or_path models/XVERSE-13B-256K \
    --template xverse

Exception in thread Thread-7 (generate):
Traceback (most recent call last):
  File " Miniconda/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File " Miniconda/envs/llama_factory/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File " Miniconda/envs/llama_factory/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File " Miniconda/envs/llama_factory/lib/python3.10/site-packages/transformers/generation/utils.py", line 1764, in generate
    return self.sample(
  File " Miniconda/envs/llama_factory/lib/python3.10/site-packages/transformers/generation/utils.py", line 2861, in sample
    outputs = self(
  File " Miniconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File " Miniconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File " .cache/huggingface/modules/transformers_modules/XVERSE-13B-256K/modeling_xverse.py", line 715, in forward
    outputs = self.model(
  File " Miniconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File " Miniconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File " .cache/huggingface/modules/transformers_modules/XVERSE-13B-256K/modeling_xverse.py", line 603, in forward
    layer_outputs = decoder_layer(
  File " Miniconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File " Miniconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File " .cache/huggingface/modules/transformers_modules/XVERSE-13B-256K/modeling_xverse.py", line 311, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File " Miniconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File " Miniconda/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File " .cache/huggingface/modules/transformers_modules/XVERSE-13B-256K/modeling_xverse.py", line 249, in forward
    assert not use_cache, "use_cache is not supported"
AssertionError: use_cache is not supported

Expected behavior

Bug report

System Info

transformers version: 4.36.2
Platform: Linux-5.4.0-159-generic-x86_64-with-glibc2.31
Python version: 3.10.13
Huggingface_hub version: 0.19.4
Safetensors version: 0.4.1
Accelerate version: 0.25.0
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'gradient_accumulation_steps': 'auto', 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': False, 'zero_stage': 2}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
PyTorch version (GPU?): 2.1.2+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Others

No response

[INFO|configuration_utils.py:802] 2024-01-29 15:32:49,297 >> Model config XverseConfig { "_name_or_path": "models/XVERSE-13B-256K", "architectures": [ "XverseForCausalLM" ], "auto_map": { "AutoConfig": "configuration_xverse.XverseConfig", "AutoModelForCausalLM": "modeling_xverse.XverseForCausalLM" }, "bos_token_id": 2, "eos_token_id": 3, "hidden_act": "silu", "hidden_size": 5120, "initializer_range": 0.02, "intermediate_size": 13824, "max_position_embeddings": 32768, "max_tokenizer_truncation": 262144, "model_type": "xverse", "num_attention_heads": 40, "num_hidden_layers": 40, "pad_token_id": 1, "rms_norm_eps": 1e-06, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.36.2", "use_cache": false, "vocab_size": 100534 }

[INFO|modeling_utils.py:3341] 2024-01-29 15:32:49,425 >> loading weights filemodels/XVERSE-13B-256K/pytorch_model.bin.index.json [INFO|modeling_utils.py:1341] 2024-01-29 15:32:49,426 >> Instantiating XverseForCausalLM model under default dtype torch.bfloat16. [INFO|configuration_utils.py:826] 2024-01-29 15:32:49,427 >> Generate config GenerationConfig { "bos_token_id": 2, "eos_token_id": 3, "pad_token_id": 1, "use_cache": false }

hiyouga / LLaMA-Factory