taishan1994 / Llama3.1-Finetuning

对llama3进行全参微调、lora微调以及qlora微调。
Apache License 2.0
149 stars 15 forks source link

请问llama3需要进行模型转换吗? #7

Closed Lavenderlyu closed 2 months ago

Lavenderlyu commented 4 months ago

运行时没有进行模型转换,产生报错

截屏2024-06-21 15 42 19 截屏2024-06-21 15 42 09
taishan1994 commented 4 months ago

不需要转换,要保证:

  1. 包的版本是否一致。
  2. 在命令行带参数运行。
Lavenderlyu commented 4 months ago

感谢您,请问python的版本是多少?

taishan1994 commented 4 months ago

3.8

Lavenderlyu commented 4 months ago

非常谢谢您!目前我的环境不支持bitsandbytes0.42.0以上的版本,其他版本全部相同 (llama3_ft) [xylv@server01 script]$ pip install bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl WARNING: Requirement 'bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl' looks like a filename, but the file does not exist ERROR: bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl is not a supported wheel on this platform. (llama3_ft) [xylv@server01 script]$ python -V Python 3.8.19

目前遇到的问题是:

  1. cpu optimizer不支持gcc5以下版本 实验室服务器无法变更gcc版本,于是取消cpu optimizer 遇到了以下的问题:

  2. 将ds_config_zero3中的"zero_optimization"修改了"offload_optimizer": {"device": "none"},之后报错如下:

image /home/xylv/anaconda3/envs/llama3_ft/lib/python3.8/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( [2024-06-21 22:43:27,797] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-06-21 22:43:27,797] [INFO] [comm.py:594:init_distributed] cdb=None [2024-06-21 22:43:27,797] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl quantization_config: None Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:14<00:00, 3.69s/it] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. ['o_proj', 'k_proj', 'q_proj', 'v_proj'] None trainable params: 54,525,952 || all params: 8,084,787,200 || trainable%: 0.6744265575722265 Loading data... Formatting inputs...Skip in lazy mode /home/xylv/anaconda3/envs/llama3_ft/lib/python3.8/site-packages/transformers/optimization.py:429: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning warnings.warn( /home/xylv/anaconda3/envs/llama3_ft/lib/python3.8/site-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an accelerate.DataLoaderConfiguration instead: dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True) warnings.warn( Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. Traceback (most recent call last): File "../finetune_llama3.py", line 434, in train() File "../finetune_llama3.py", line 421, in train trainer = Trainer( File "/home/xylv/anaconda3/envs/llama3_ft/lib/python3.8/site-packages/transformers/trainer.py", line 527, in init raise RuntimeError( RuntimeError: Passing optimizers is not allowed if Deepspeed or PyTorch FSDP is enabled. You should subclass Trainer and override the create_optimizer_and_scheduler method. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 56578) of binary: /home/xylv/anaconda3/envs/llama3_ft/bin/python Traceback (most recent call last): File "/home/xylv/anaconda3/envs/llama3_ft/bin/torchrun", line 8, in sys.exit(main()) File "/home/xylv/anaconda3/envs/llama3_ft/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/xylv/anaconda3/envs/llama3_ft/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/xylv/anaconda3/envs/llama3_ft/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/xylv/anaconda3/envs/llama3_ft/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/xylv/anaconda3/envs/llama3_ft/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

请问有备选的解决方案吗?非常谢谢,我的邮箱是lyuxiangyue@qq.com,如果有其他解决方案您愿意的话可以邮件给我吗?非常感谢

taishan1994 commented 4 months ago

用zero2的配置。

Lavenderlyu commented 4 months ago

感谢大佬!更新啦gcc以后成功了!