hiyouga / LLaMA-Factory

Unify Efficient Fine-Tuning of 100+ LLMs
Apache License 2.0
25.63k stars 3.17k forks source link

关于npu训练模型总结以及疑问 #4388

Open sweetning0809 opened 2 weeks ago

sweetning0809 commented 2 weeks ago

Reminder

System Info

QWEN2-7B(MoE)

需要使用bf16 #4278 正常

glm4

注释掉torch.jit行 使用bf16 参考 #4339 #3788

chatglm3

同上方式 但模型合并后需要将原文件夹除去*bin和pytorch_model.bin.index.json以外的文件复制过来 参考 #1307

DeepSeek (MoE)

失败 需要将模型做算子转化 参考:https://www.hiascend.com/document/detail/zh/Pytorch/60RC1/ptmoddevg/trainingmigrguide/performance_tuning_0027.html#ZH-CN_TOPIC_0000001889766765__section132951137183219

gemma

正常

LLaMA-3

正常

Baichuan-2

正常

PHI3

报错 File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connection.py", line 615, in connect contents = read_file_cached(tiktoken_bpe_file, expected_hash) File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/load.py", line 64, in read_file_cached contents = read_file(blobpath) File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/load.py", line 25, in read_file resp = requests.get(blobpath) File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/api.py", line 73, in get self.sock = sock = self._new_conn() File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connection.py", line 203, in _new_conn return request("get", url, params=params, kwargs) File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/api.py", line 59, in request conn.connect() File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connection.py", line 615, in connect self._validate_conn(conn) File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 1095, in _validate_conn return session.request(method=method, url=url, kwargs) File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/sessions.py", line 589, in request return tokenizer_class.from_pretrained( File "/home/hadoop-friday-llm/.cache/huggingface/modules/transformers_modules/Phi-3-small-8k-instruct/tokenization_phi3_small.py", line 190, in from_pretrained raise NameResolutionError(self.host, self, e) from e urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7f4053c11070>: Failed to resolve 'openaipublic.blob.core.windows.net' ([Errno -2] Name or service not known)

Mistral-7B-v0.1

正常

Mixtral-8x7B-v0.1

8卡 64G需要stage3

CodeLlama-7b-hf(13B)

正常

Yi1.5

正常

Reproduction

llamafactory

Expected behavior

主要挑选了一些具有代表性的模型 重新在npu上实验 希望可以全部成功 但是phi3的失败希望可以解答一下 模型确认是在本地 并使用的绝对路径

Others

No response

hiyouga commented 2 weeks ago

cc @statelesshz

sweetning0809 commented 2 weeks ago

补充报错:Traceback (most recent call last): File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/adapters.py", line 667, in send return cls(cls_kwargs) File "/home/hadoop-friday-llm/.cache/huggingface/modules/transformers_modules/Phi-3-small-8k-instruct/tokenization_phi3_small.py", line 105, in init base = tiktoken.get_encoding("cl100k_base") File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/registry.py", line 73, in get_encoding resp = self.send(prep, send_kwargs) File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/sessions.py", line 703, in send enc = Encoding(constructor()) File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken_ext/openai_public.py", line 72, in cl100k_base self.sock = sock = self._new_conn()
resp = conn.urlopen( File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connection.py", line 203, in _new_conn File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 843, in urlopen mergeable_ranks = load_tiktoken_bpe( File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/load.py", line 147, in load_tiktoken_bpe contents = read_file_cached(tiktoken_bpe_file, expected_hash) File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/load.py", line 64, in read_file_cached r = adapter.send(request,
kwargs) File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/adapters.py", line 700, in send raise NameResolutionError(self.host, self, e) from e
contents = read_file(blobpath) urllib3.exceptions File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/load.py", line 25, in read_file .NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7fa0f0927340>: Failed to resolve 'openaipublic.blob.core.windows.net' ([Errno -2] Name or service not known) 怀疑base = tiktoken.get_encoding("cl100k_base") 是不是必须访问网络

sweetning0809 commented 2 weeks ago

怀疑base = tiktoken.get_encoding("cl100k_base") 是不是必须访问网络 同样符合是访问openaipublic.blob.core.windows.net

sweetning0809 commented 2 weeks ago

怀疑base = tiktoken.get_encoding("cl100k_base") 是不是必须访问网络 同样符合是访问openaipublic.blob.core.windows.net

查看了模型文件 权重文件夹同层存在cl100k_base.tiktoken 可能没有使用上?

sweetning0809 commented 2 weeks ago

怀疑base = tiktoken.get_encoding("cl100k_base") 是不是必须访问网络 同样符合是访问openaipublic.blob.core.windows.net

查看了模型文件 权重文件夹同层存在cl100k_base.tiktoken 可能没有使用上?

这个问题解决了是tiktoken.get_encoding("cl100k_base") 必须访问外网 阅读tiktoken的get_encoding源码可以发现先使用了hash再去网上寻找的 同时对文件名求了hash1 于是可以

  1. export TIKTOKEN_CACHE_DIR=
  2. 然后吧cl100k_base.tiktoken 放在 TIKTOKEN_CACHE_DIR底下同时改名为hash取值:9b5ad71b2ce5302211f9c61530b329a4922fc6a4

但是遇到了新问题assert is_flash_attention_available, "Flash Attention is not available, but is needed for dense attention" npu无法使用flash_attention 可能和deepseek同样需要算子转换

exceedzhang commented 1 week ago

@sweetning0809 python版本是多少?3.10 我使用3.10版本遇到如下问题,训练Qwen2和LLaMA3是正常可以的,但有系统提示错误,我估计会影响模型性能。 image

sweetning0809 commented 1 week ago

@sweetning0809 python版本是多少?3.10 我使用3.10版本遇到如下问题,训练Qwen2和LLaMA3是正常可以的,但有系统提示错误,我估计会影响模型性能。 image

我回顾了一下日志没有看到这种 我是py 3.9 看着像 https://github.com/Ascend/DeepSpeed/commit/c134c39d720a78ad3e285b5a5959d0320fd0964a 这个类似的报错

sweetning0809 commented 1 week ago

@sweetning0809 python版本是多少?3.10 我使用3.10版本遇到如下问题,训练Qwen2和LLaMA3是正常可以的,但有系统提示错误,我估计会影响模型性能。 image

做梯度转换的时候没有check但是感觉不会影响效果 不是很确定 可以训练出来先评测一下

exceedzhang commented 1 week ago

感谢!这个错误我查了一下应该只有python3.10才会有,python3.9版本应该不会有这个问题!

image