Years-Enron commented 1 year ago

镜像

pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel

执行命令

git clone https://github.com/modelscope/swift.git
cd swift/examples/pytorch/llm
bash scripts/qwen_7b/qlora/sft.sh

执行结果

Loading checkpoint shards:   0%|                                                                                                                                                                              | 0/8 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/root/swift/examples/pytorch/llm/src/llm_sft.py", line 291, in <module>
    llm_sft(args)
  File "/root/swift/examples/pytorch/llm/src/llm_sft.py", line 167, in llm_sft
    model, tokenizer = get_model_tokenizer(
  File "/root/swift/examples/pytorch/llm/src/utils/models.py", line 259, in get_model_tokenizer
    model, tokenizer = get_function(model_dir, torch_dtype, load_model,
  File "/root/swift/examples/pytorch/llm/src/utils/models.py", line 151, in get_model_tokenizer_qwen
    return get_model_tokenizer_from_repo(model_dir, torch_dtype, load_model,
  File "/root/swift/examples/pytorch/llm/src/utils/models.py", line 44, in get_model_tokenizer_from_repo
    model = AutoModelForCausalLM.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/modelscope/utils/hf_util.py", line 98, in from_pretrained
    model = module_class.from_pretrained(model_dir, *model_args,
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 488, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/modelscope/utils/hf_util.py", line 64, in from_pretrained
    return ori_from_pretrained(cls, model_dir, *model_args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3260, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 725, in _load_state_dict_into_meta_model
    set_module_quantized_tensor_to_device(
  File "/opt/conda/lib/python3.10/site-packages/transformers/utils/bitsandbytes.py", line 109, in set_module_quantized_tensor_to_device
    new_value = value.to(device)
RuntimeError: Device index must not be negative

root@dlcl4o079d5hls8w-master-0:~/swift/examples/pytorch/llm# python 
Python 3.10.11 (main, Apr 20 2023, 19:02:41) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> 
>>> print(torch.__version__)
2.0.1
>>> 
>>> 
>>> import torch
>>> flag = torch.cuda.is_available()
>>> if flag:
...     print("CUDA可使用")
... else:
...     print("CUDA不可用")
... 
CUDA可使用
>>> 
>>> ngpu= 1
>>> # Decide which device we want to run on
>>> device = torch.device("cuda:0" if (torch.cuda.is_available() and ngpu > 0) else "cpu")
>>> 
>>> device
device(type='cuda', index=0)
>>> print("驱动为：",device)
驱动为： cuda:0
>>> print("GPU型号： ",torch.cuda.get_device_name(0))
GPU型号：  NVIDIA A10
>>>

Years-Enron commented 1 year ago

通过定义环境变量 RANK=-1 可解决上面的错误

Jintao-Huang commented 1 year ago

好的，感谢你的issue～我们会赶紧进行排查, 然后回复你

Jintao-Huang commented 1 year ago

我没有复现你的问题诶, 不过我猜测是你将RANK环境变量设置成了0, 但是LOCAL_RANK环境变量设置成了-1. 这种情况出现的还是比较少的诶

Years-Enron commented 1 year ago

我没有复现你的问题诶, 不过我猜测是你将RANK环境变量设置成了0, 但是LOCAL_RANK环境变量设置成了-1. 这种情况出现的还是比较少的诶

没有主动设置环境变量，貌似是docker镜像里默认环境变量是你说的情况，用的阿里云dlc创建出来环境

modelscope / ms-swift

RuntimeError: Device index must not be negative #15

镜像

执行命令

执行结果