Open sweetning0809 opened 2 weeks ago
cc @statelesshz
补充报错:Traceback (most recent call last):
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/adapters.py", line 667, in send
return cls(cls_kwargs)
File "/home/hadoop-friday-llm/.cache/huggingface/modules/transformers_modules/Phi-3-small-8k-instruct/tokenization_phi3_small.py", line 105, in init
base = tiktoken.get_encoding("cl100k_base")
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/registry.py", line 73, in get_encoding
resp = self.send(prep, send_kwargs)
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/sessions.py", line 703, in send
enc = Encoding(constructor())
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken_ext/openai_public.py", line 72, in cl100k_base
self.sock = sock = self._new_conn()
resp = conn.urlopen(
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connection.py", line 203, in _new_conn
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 843, in urlopen
mergeable_ranks = load_tiktoken_bpe(
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/load.py", line 147, in load_tiktoken_bpe
contents = read_file_cached(tiktoken_bpe_file, expected_hash)
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/load.py", line 64, in read_file_cached
r = adapter.send(request, kwargs)
File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/adapters.py", line 700, in send
raise NameResolutionError(self.host, self, e) from e
contents = read_file(blobpath)
urllib3.exceptions File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/load.py", line 25, in read_file
.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7fa0f0927340>: Failed to resolve 'openaipublic.blob.core.windows.net' ([Errno -2] Name or service not known)
怀疑base = tiktoken.get_encoding("cl100k_base") 是不是必须访问网络
怀疑base = tiktoken.get_encoding("cl100k_base") 是不是必须访问网络 同样符合是访问openaipublic.blob.core.windows.net
怀疑base = tiktoken.get_encoding("cl100k_base") 是不是必须访问网络 同样符合是访问openaipublic.blob.core.windows.net
查看了模型文件 权重文件夹同层存在cl100k_base.tiktoken 可能没有使用上?
怀疑base = tiktoken.get_encoding("cl100k_base") 是不是必须访问网络 同样符合是访问openaipublic.blob.core.windows.net
查看了模型文件 权重文件夹同层存在cl100k_base.tiktoken 可能没有使用上?
这个问题解决了是tiktoken.get_encoding("cl100k_base") 必须访问外网 阅读tiktoken的get_encoding源码可以发现先使用了hash再去网上寻找的 同时对文件名求了hash1 于是可以
但是遇到了新问题assert is_flash_attention_available, "Flash Attention is not available, but is needed for dense attention" npu无法使用flash_attention 可能和deepseek同样需要算子转换
@sweetning0809 python版本是多少?3.10 我使用3.10版本遇到如下问题,训练Qwen2和LLaMA3是正常可以的,但有系统提示错误,我估计会影响模型性能。
@sweetning0809 python版本是多少?3.10 我使用3.10版本遇到如下问题,训练Qwen2和LLaMA3是正常可以的,但有系统提示错误,我估计会影响模型性能。
我回顾了一下日志没有看到这种 我是py 3.9 看着像 https://github.com/Ascend/DeepSpeed/commit/c134c39d720a78ad3e285b5a5959d0320fd0964a 这个类似的报错
@sweetning0809 python版本是多少?3.10 我使用3.10版本遇到如下问题,训练Qwen2和LLaMA3是正常可以的,但有系统提示错误,我估计会影响模型性能。
做梯度转换的时候没有check但是感觉不会影响效果 不是很确定 可以训练出来先评测一下
感谢!这个错误我查了一下应该只有python3.10才会有,python3.9版本应该不会有这个问题!
Reminder
System Info
QWEN2-7B(MoE)
需要使用bf16 #4278 正常
glm4
注释掉torch.jit行 使用bf16 参考 #4339 #3788
chatglm3
同上方式 但模型合并后需要将原文件夹除去*bin和pytorch_model.bin.index.json以外的文件复制过来 参考 #1307
DeepSeek (MoE)
失败 需要将模型做算子转化 参考:https://www.hiascend.com/document/detail/zh/Pytorch/60RC1/ptmoddevg/trainingmigrguide/performance_tuning_0027.html#ZH-CN_TOPIC_0000001889766765__section132951137183219
gemma
正常
LLaMA-3
正常
Baichuan-2
正常
PHI3
报错 File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connection.py", line 615, in connect contents = read_file_cached(tiktoken_bpe_file, expected_hash) File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/load.py", line 64, in read_file_cached contents = read_file(blobpath) File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/load.py", line 25, in read_file resp = requests.get(blobpath) File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/api.py", line 73, in get self.sock = sock = self._new_conn() File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connection.py", line 203, in _new_conn return request("get", url, params=params, kwargs) File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/api.py", line 59, in request conn.connect() File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connection.py", line 615, in connect self._validate_conn(conn) File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 1095, in _validate_conn return session.request(method=method, url=url, kwargs) File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/sessions.py", line 589, in request return tokenizer_class.from_pretrained( File "/home/hadoop-friday-llm/.cache/huggingface/modules/transformers_modules/Phi-3-small-8k-instruct/tokenization_phi3_small.py", line 190, in from_pretrained raise NameResolutionError(self.host, self, e) from e urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7f4053c11070>: Failed to resolve 'openaipublic.blob.core.windows.net' ([Errno -2] Name or service not known)
Mistral-7B-v0.1
正常
Mixtral-8x7B-v0.1
8卡 64G需要stage3
CodeLlama-7b-hf(13B)
正常
Yi1.5
正常
Reproduction
llamafactory
Expected behavior
主要挑选了一些具有代表性的模型 重新在npu上实验 希望可以全部成功 但是phi3的失败希望可以解答一下 模型确认是在本地 并使用的绝对路径
Others
No response