没有办法加载tokenizer

wpNZC commented 3 months ago

我下载了官方的预训练模型并尝试使用自己的推理一遍，但是出现tokenizer无法加载的报错，先后尝试添加镜像网站，调整local_files_only，下载模型到本地并上传至服务器以及其他各种加载方法都没办法解决，例如官方文档示例的相对路径，以及绝对路径等等。

我修改的代码为ovdino/detrex/modeling/language_backbone/bert.py line 40的self.tokenizer = AutoTokenizer.from_pretrained，这也是报错的主要位置。

各种修改后的报错类型有： 1、huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/root/data1/vision/OV-DINO/OV-DINO/ovdino/detrex/modeling/language_backbone/bert-base-uncased/'. Use repo_type argument if needed. 2、OSError: We couldn't connect to 'https://hf-mirror.com' to load this file, couldn't find it in the cached files and it looks like bert-base-uncased is not the path to a directory containing a file named config.json. Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'. 3、huggingfacehub.errors.HFValidationError: Repo id must use alphanumeric chars or '-', '', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: './bert-base-uncased'.

其中见到一些项目通过修改其他位置的代码实现了本地加载，所以可能是我对报错的解决理解有误，因此寻求帮助。

其中之一的完整输入输出如下： (py310) root@ubuntu-virtual-machine:/data1/vision/OV-DINO/OV-DINO/ovdino# sh scripts/demo.sh projects/ovdino/configs/ovdino_swin_tiny224_bert_base_infer_demo.py ../inits/ovdino/ovdino_swint_ogc-coco50.2_lvismv40.1_lvis32.9.pth "person box carbon" test/0bd8249d156e430c86ee75082c72428f.jpg result.jpg SAM2 is not installed. [08/14 08:42:14 detectron2]: Arguments: Namespace(config_file='projects/ovdino/configs/ovdino_swin_tiny224_bert_base_infer_demo.py', sam_config_file='sam2_hiera_l.yaml', sam_init_checkpoint='/data1/vision/OV-DINO/OV-DINO/inits/sam2/sam2_hiera_large.pt', webcam=False, video_input=None, input=['test/0bd8249d156e430c86ee75082c72428f.jpg'], category_names=['person', 'box', 'carbon'], output='result.jpg', min_size_test=800, max_size_test=1333, img_format='RGB', metadata_dataset='coco_2017_val', confidence_threshold=0.5, opts=['train.init_checkpoint=/data1/vision/OV-DINO/OV-DINO/inits/ovdino/ovdino_swint_ogc-coco50.2_lvismv40.1_lvis32.9.pth', 'model.num_classes=3']) /root/anaconda3/envs/py310/lib/python3.10/site-packages/torch/functional.py:512: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3587.) return _VF.meshgrid(tensors, kwargs) # type: ignore[attr-defined] Traceback (most recent call last): File "/data1/vision/OV-DINO/OV-DINO/ovdino/./demo/demo.py", line 147, in model = instantiate(cfg.model) File "/data1/vision/OV-DINO/OV-DINO/ovdino/detectron2-717ab9/detectron2/config/instantiate.py", line 67, in instantiate cfg = {k: instantiate(v) for k, v in cfg.items()} File "/data1/vision/OV-DINO/OV-DINO/ovdino/detectron2-717ab9/detectron2/config/instantiate.py", line 67, in cfg = {k: instantiate(v) for k, v in cfg.items()} File "/data1/vision/OV-DINO/OV-DINO/ovdino/detectron2-717ab9/detectron2/config/instantiate.py", line 83, in instantiate return cls(cfg) File "/data1/vision/OV-DINO/OV-DINO/ovdino/./detrex/modeling/language_backbone/bert.py", line 99, in init self.tokenizer = BERTTokenizer(tokenizer_cfg) File "/data1/vision/OV-DINO/OV-DINO/ovdino/./detrex/modeling/language_backbone/bert.py", line 40, in init self.tokenizer = AutoTokenizer.from_pretrained("/root/data1/vision/OV-DINO/OV-DINO/ovdino/detrex/modeling/language_backbone/bert-base-uncased/") File "/root/anaconda3/envs/py310/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 643, in from_pretrained tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, kwargs) File "/root/anaconda3/envs/py310/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 487, in get_tokenizer_config resolved_config_file = cached_file( File "/root/anaconda3/envs/py310/lib/python3.10/site-packages/transformers/utils/hub.py", line 417, in cached_file resolved_file = hf_hub_download( File "/root/anaconda3/envs/py310/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f return f(*args, **kwargs) File "/root/anaconda3/envs/py310/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn validate_repo_id(arg_value) File "/root/anaconda3/envs/py310/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id raise HFValidationError( huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/root/data1/vision/OV-DINO/OV-DINO/ovdino/detrex/modeling/language_backbone/bert-base-uncased/'. Use repo_type argument if needed.

wanghao9610 commented 3 months ago

@wpNZC 你好，你这个原因很可能是模型下载的问题。在ovdino/scripts/train.sh中我声明了HF_HOME环境变量，这个会改变huggingface模型下载保存的路径。如果你是按照使用指导运行的，你可以检查一下，该文件夹下是否有相关模型文件(默认huggingface路径是~/.cahce/huggingface/hub)。国内下载的网络可能会出问题，你可以考虑加入下面的环境变量：

export HF_ENDPOINT="https://hf-mirror.com"

wanghao9610 commented 3 months ago

如果你需要本地加载的话，你需要手动将所有文件都下载到本地中，可以从这里下载。不能直接使用huggingaface代码下载的路径！然后修改相应位置的代码为你保存路径的绝对路径，有两个地方： 1）修改tokenizer的tokenizer_name为保存的绝对路径， 2）修改model的model_name为保存绝对路径。

wpNZC commented 3 months ago

@wanghao9610 感谢你忙里抽闲的回复，你提到的方法我都尝试过，仍然不行，我也尝试过额外新开一个文件用以测试from_pretrained是否正常运行，结果是没有问题，这也是我为何求助的原因。~/.cahce/huggingface/hub中也没有所需的bert-base-uncased，包括inits中的huggingface/hub。

报错：OSError: We couldn't connect to 'https://hf-mirror.com' to load this file, couldn't find it in the cached files and it looks like bert-base-uncased is not the path to a directory containing a file named config.json. Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

但是镜像的网站是可以ping通的。尽管我于from_pretrained中添加local_files_only=True，仍然无法阻止其连接repo或镜像网站。关于使用指导： 1）由于我想尝试推理自定义数据并微调，因而跳过了相关数据集的下载及评估，不知是否会有所影响，其中是否含有一部分初始化的操作？就像YOLOv5那般？ 2）因服务器cuda和使用指导安装命令版本不一致，环境选择了安装最新版的pytorch2.4.0，cuda是12.1。pytorch及cuda版本会造成相关影响吗？

wanghao9610 commented 3 months ago

这个主要的问题还是你的网络连接问题，我提供一种可行的解决方案：在自己的pc上把 https://hf-mirror.com/google-bert/bert-base-uncased/tree/main 所有的文件下载到本地，然后上传到使用的服务器上。修改代码中为绝对路径，参考上面的回答。 train.sh中TRANSFORMERS_OFFLINE环境变量会加载本地的文件，或许你可以先把这一行注释掉试试？ 1）不下载数据集不会有影响； 2）建议使用推荐的环境配置。

wpNZC commented 3 months ago

在前面的尝试中bert-base-uncased文件已经被我上传至使用的服务器并修改代码为绝对路径指向，但是仍然不起作用，哪怕是我故意输错都没有 [Errno 2] No such file or directory。同时，我刚刚立刻尝试注释了TRANSFORMERS_OFFLINE环境变量仍然不起作用。或许我只能尝试在网络上下功夫了。最后，感谢你的回复。

wanghao9610 commented 3 months ago

我尝试了从绝对路径导入，是没有问题的。你首先需要确认一下，你下载的文件是否完整。你可以参考一下这个hfd.sh，这个脚本国内下载很快。不需要设置TRANSFORMERS_OFFLINE 环境变量。测试的代码如下：

from transformers import AutoTokenizer, BertConfig, BertModel

model_dir="./bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model_cfg = BertConfig.from_pretrained(model_dir)
model = BertModel.from_pretrained(model_dir)

wpNZC commented 3 months ago

你好，我当时是直接从镜像站git到本地pc后上传的，应该没问题：

不过我使用你的测试代码运行时出现了警告： Some weights of the model checkpoint at ./bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight']

This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

本想尝试你提到的hfd.sh的，但是没有成功： (base) root@ubuntu-virtual-machine:/data1/vision/OV-DINO/OV-DINO/ovdino# hfd google-bert/bert-base-uncased Downloading to bert-base-uncased Testing GIT_REFS_URL: https://huggingface.co/google-bert/bert-base-uncased/info/refs?service=git-upload-pack Unexpected HTTP Status Code: 000 Executing debug command: curl -v https://huggingface.co/google-bert/bert-base-uncased/info/refs?service=git-upload-pack Output:

Trying 199.59.148.147:443...
Trying [2a03:2880:f111:83:face:b00c:0:25de]:443...
Immediate connect fail for 2a03:2880:f111:83:face:b00c:0:25de: Network is unreachable
connect to 199.59.148.147 port 443 failed: Connection timed out
Failed to connect to huggingface.co port 443 after 131051 ms: Couldn't connect to server
Closing connection curl: (28) Failed to connect to huggingface.co port 443 after 131051 ms: Couldn't connect to server

Git clone failed.

wanghao9610 commented 3 months ago

这个warning是正常的，你能正常加载运行这个测试代码，说明你的模型权重是正确下载的。hfd不能下载说明你的网络确实不太好（国内最大科研障碍是网络🤣）。你参考(https://huggingface.co/docs/transformers/installation#offline-mode) 看能否解决你的问题？欢迎更新后续解决方法。

wpNZC commented 3 months ago

@wanghao9610 跑起来了，发现问题了。。。。通过追踪路径值的变化发现：bert.py中的BERTEncoder在init时直接对tokenizer_cfg和model_name赋值路径字符串的话会出现只保留最后一个文件夹路径的情况，即进行了等价于str.rsplit("/")[-1]的操作，使路径始终为bert-base-uncased，而repo，huggingface等比本地路径优先，所以总是会出现显示连接repo等报错，因此尝试在后文中再赋值。

修改位置为：bert.py line80的BERTEncoder： class BERTEncoder(nn.Module): def init( self, tokenizer_cfg, model_name, output_dim=256, padding_mode="longest", context_length=48, pooling_mode="max", post_tokenize=False, is_normalize=False, is_proj=False, is_freeze=False, return_dict=False, ) -> None: super().init() assert pooling_mode in ["max", "mean", None] tokenizer_cfg=dict(tokenizer_name='/data1/vision/OV-DINO/OV-DINO/ovdino/detrex/modeling/language_backbone/bert-base-uncased') # Your path model_name="/data1/vision/OV-DINO/OV-DINO/ovdino/detrex/modeling/language_backbone/bert-base-uncased" # Your path

BERTTokenizer不修改也不影响，不过如果有问题可以考虑修改。

此外，inits/huggingface/hub只能用于储存在线下载的模型？之前尝试用其作为cache_dir无果，但是在跟踪数值时发现cache_dir值为inits/huggingface/hub，可能可以考虑通过os.path.join结合一下来实现cache_dir导入？

以及，"./bert-base-uncased"的用法仍然无法进行，根据追踪值的结果来看，也是在传导过程中变为"bert-base-uncased"而导致失败（还是连不上镜像网站=。=），由于已经有使用绝对路径可以实现的方案，在上述两个方向上没有细究。希望可以给项目的完善或者运行项目出现问题的人带来一些解决方向。

wanghao9610 / OV-DINO

没有办法加载tokenizer #18