请问可以在您训练好的Macbert4CSC模型的基础上加入自己的数据集finetune吗？

GDUTHeZexi commented 2 years ago

我的算力不足以将自己的数据集与SIGHAN+Wang271K的数据集放在一起训练，有方法直接在您发布的预训练模型的基础上finetune吗？

shibing624 commented 2 years ago

可以，设置 BERT_CKPT 为 shibing624/macbert4csc-base-chinese 即可。

banbsyip commented 2 years ago

可以，设置 BERT_CKPT 为 shibing624/macbert4csc-base-chinese 即可。

我这边在train_macbert4csc.yml将BERT_CKPT: "hfl/chinese-macbert-base"改为了shibing624/macbert4csc-base-chinese ，但是train。py报错： load model, model arch: macbert4csc Traceback (most recent call last): File "/datasdc_3421/asr/ubuntu20.04/espnet/tools/anaconda/envs/yi/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1752, in from_pretrained user_agent=user_agent, File "/datasdc_3421/asr/ubuntu20.04/espnet/tools/anaconda/envs/yi/lib/python3.7/site-packages/transformers/utils/hub.py", line 292, in cached_path local_files_only=local_files_only, File "/datasdc_3421/asr/ubuntu20.04/espnet/tools/anaconda/envs/yi/lib/python3.7/site-packages/transformers/utils/hub.py", line 502, in get_from_cache _raise_for_status(r) File "/datasdc_3421/asr/ubuntu20.04/espnet/tools/anaconda/envs/yi/lib/python3.7/site-packages/transformers/utils/hub.py", line 418, in _raise_for_status f"401 Client Error: Repository not found for url: {response.url}. " transformers.utils.hub.RepositoryNotFoundError: 401 Client Error: Repository not found for url: https://huggingface.co/shibing624/chinese-macbert-base/resolve/main/vocab.txt. If the repo is private, make sure you are authenticated.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train.py", line 145, in main() File "train.py", line 76, in main tokenizer = BertTokenizer.from_pretrained(cfg.MODEL.BERT_CKPT) File "/datasdc_3421/asr/ubuntu20.04/espnet/tools/anaconda/envs/yi/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1763, in from_pretrained f"{pretrained_model_name_or_path} is not a local folder and is not a valid model identifier " OSError: shibing624/chinese-macbert-base is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models' If this is a private repository, make sure to pass a token having permission to this repo with use_auth_token or log in with huggingface-cli login and pass use_auth_token=True.

shibing624 commented 2 years ago

下载模型到本地，设置绝对路径。

winniechiou commented 2 years ago

您好，我也想fine-tune，有以下幾點想確認，謝謝。 1.macbert4csc與softmaskedbert4csc的差別是什麼？ 2.想確認fine-tune步驟的理解是否正確，以及2-c中，我該如何下載模型到本地呢？我找不到下載鏈接 2-a. 將自己的資料整理為json格式 2-b. 將train_softmaskedbert4csc.yml內BERT_CKPT 設定為 shibing624/macbert4csc-base-chinese 2-c. 下載模型到本地，執行python train.py --config_file train_softmaskedbert4csc.yml 2-d. 將transformers调用內tokenizer以及model的路徑改成本地模型路徑

import operator
import torch
from transformers import BertTokenizer, BertForMaskedLM
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = BertTokenizer.from_pretrained("shibing624/macbert4csc-base-chinese")
model = BertForMaskedLM.from_pretrained("shibing624/macbert4csc-base-chinese")

shibing624 commented 2 years ago

1、模型结构diff，macbert4csc是使用Linear简化了detection； 2、下载地址：https://huggingface.co/shibing624/macbert4csc-base-chinese

selena531 commented 7 months ago

OSError: Can't load tokenizer for 'shibing624/macbert4csc-base-chinese'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'shibing624/macbert4csc-base-chinese' is the correct path to a directory containing all relevant files for a BertTokenizerFast tokenizer. 求救这个怎么解决呢

shibing624 commented 7 months ago

https://hf-mirror.com/shibing624/macbert4csc-base-chinese/tree/main 用这个链接下载到本地。

bucaiLi commented 7 months ago

OSError: Can't load tokenizer for 'shibing624/macbert4csc-base-chinese'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'shibing624/macbert4csc-base-chinese' is the correct path to a directory containing all relevant files for a BertTokenizerFast tokenizer. 求救这个怎么解决呢

请问老哥你解决了嘛，我已经下载到本地了，model_name_or_path也改成了本地的绝对路径，但还是报这个错误，意思是根本没有从本地加载模型，还是在hugingface上下载的，南大奥说还要改别的参数？

bucaiLi commented 7 months ago

https://hf-mirror.com/shibing624/macbert4csc-base-chinese/tree/main 用这个链接下载到本地。

博主你好，请问为什么我把链接里面所有的东西都下载到本地，而且在MacBertCorrector类里面也指定了self.tokenizer = BertTokenizerFast.from_pretrained(model_name_or_path,local_files_only=True) 其中model_name_or_path是我下载的模型路径"/data2/LJ/pycorrector/examples/macbert/macbert4csc-base-chinese/"。但为什么还是报 Can't load tokenizer for 'shibing624/macbert4csc-base-chinese 这个错误，看样子并没有直接从本地加载而是还是在huggingface上加载的，还有什么别的路径嘛难道

shibing624 commented 7 months ago

把vocab.txt 等文件都下载后，写文件路径的绝对路径。

bucaiLi commented 7 months ago

把vocab.txt 等文件都下载后，写文件路径的绝对路径。

是的老哥我现在的路径就是绝对路径了， def init(self, model_name_or_path="/data2/LJ/pycorrector/examples/macbert/macbert4csc-base-chinese"): t1 = time.time() self.tokenizer = BertTokenizerFast.from_pretrained(model_name_or_path,local_files_only=True) self.model = BertForMaskedLM.from_pretrained(model_name_or_path,local_files_only=True) 函数现在是这样的

我的文件目录是/data2/LJ/pycorrector/examples/macbert/这个文件夹下的 macbertcsc-base-chinese --onnx (onnx文件夹里面包含的是所有的onnx文件，结构和huggingface上一样） --pytorch_model.bin --voacb.txt --model.safetensors 等文件老哥我这是哪里错了，我已经是绝对路径了。执行的是macbert文件夹里面的demo.py

shibing624 commented 7 months ago

你已经将模型文件下载到了本地，并且在代码中指定了local_files_only=True。但是，仍然出现错误。这可能是因为from_pretrained函数在寻找模型时，会先尝试从Hugging Face模型库中加载，如果没有找到，才会尝试从本地加载。

为了确保从本地加载模型，你可以尝试以下方法：

确保你的文件夹路径和文件名都是正确的。在你的代码中，你使用的路径是"/data2/LJ/pycorrector/examples/macbert/macbert4csc-base-chinese/"，但实际上你的文件夹结构是"/data2/LJ/pycorrector/examples/macbert/macbertcsc-base-chinese"。请注意，文件夹名中有一个多余的"macbert"。请确保你的路径是正确的。

尝试使用from_pretrained函数的cache_dir参数。这个参数允许你指定一个目录，用于存放和加载预训练模型。你可以将这个参数设置为你的模型所在的目录。例如：

from transformers import BertTokenizerFast, BertForMaskedLM

def init(self, model_name_or_path="/data2/LJ/pycorrector/examples/macbert/macbertcsc-base-chinese"):
    t1 = time.time()
    self.tokenizer = BertTokenizerFast.from_pretrained(model_name_or_path, local_files_only=True, cache_dir=model_name_or_path)
    self.model = BertForMaskedLM.from_pretrained(model_name_or_path, local_files_only=True, cache_dir=model_name_or_path)

这样，from_pretrained函数会首先在指定的cache_dir中查找模型，而不是尝试从Hugging Face模型库中加载。

bucaiLi commented 7 months ago

你已经将模型文件下载到了本地，并且在代码中指定了local_files_only=True。但是，仍然出现错误。这可能是因为from_pretrained函数在寻找模型时，会先尝试从Hugging Face模型库中加载，如果没有找到，才会尝试从本地加载。

为了确保从本地加载模型，你可以尝试以下方法：

确保你的文件夹路径和文件名都是正确的。在你的代码中，你使用的路径是"/data2/LJ/pycorrector/examples/macbert/macbert4csc-base-chinese/"，但实际上你的文件夹结构是"/data2/LJ/pycorrector/examples/macbert/macbertcsc-base-chinese"。请注意，文件夹名中有一个多余的"macbert"。请确保你的路径是正确的。

尝试使用from_pretrained函数的cache_dir参数。这个参数允许你指定一个目录，用于存放和加载预训练模型。你可以将这个参数设置为你的模型所在的目录。例如：
from transformers import BertTokenizerFast, BertForMaskedLM

def init(self, model_name_or_path="/data2/LJ/pycorrector/examples/macbert/macbertcsc-base-chinese"):
    t1 = time.time()
    self.tokenizer = BertTokenizerFast.from_pretrained(model_name_or_path, local_files_only=True, cache_dir=model_name_or_path)
    self.model = BertForMaskedLM.from_pretrained(model_name_or_path, local_files_only=True, cache_dir=model_name_or_path)
这样，from_pretrained函数会首先在指定的cache_dir中查找模型，而不是尝试从Hugging Face模型库中加载。

上面的是我少打一个4，为了避免歧义，我现在把存放模型的文件夹换成bertmodel. 现在的文件结构是/data2/LJ/pycorrector/examples/macbert/下的 bertmodel文件夹(这个文件夹里面是所有的模型相关的文件，model.bin，config.json文件以及onnx文件夹) api_demo.py demo.py predict.py 等py文件我在此文件夹下执行python demo.py 其中，demo.py中的实例化模型m = MacBertCorrector() MacBertCorrector函数改成了 class MacBertCorrector: def init(self, model_name_or_path="/data2/LJ/pycorrector/examples/macbert/bertmodel"): t1 = time.time() self.tokenizer = BertTokenizerFast.from_pretrained(model_name_or_path,local_files_only=True,cache_dir=model_name_or_path) self.model = BertForMaskedLM.from_pretrained(model_name_or_path,local_files_only=True,cache_dir=model_name_or_path) self.model.to(device) 但是仍然出现了Traceback (most recent call last): File "demo.py", line 36, in main() File "demo.py", line 14, in main m = MacBertCorrector() File "/home/omnisky/anaconda3/envs/LJpycorr/lib/python3.7/site-packages/pycorrector/macbert/macbert_corrector.py", line 27, in init self.tokenizer = BertTokenizerFast.from_pretrained(model_name_or_path) File "/home/omnisky/anaconda3/envs/LJpycorr/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1796, in from_pretrained f"Can't load tokenizer for '{pretrained_model_name_or_path}'. If you were trying to load it from " OSError: Can't load tokenizer for 'shibing624/macbert4csc-base-chinese'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'shibing624/macbert4csc-base-chinese' is the correct path to a directory containing all relevant files for a BertTokenizerFast tokenizer. 这个错误。意思是我仍然没有从本地加载请问博主这是不是由于我用的梯子下的模型文件或者json是坏的？或者说除了实例化这里，我需要动config.json文件呢，因为我看tokenizr_config.json里面有 "name_or_path": "shibing624/macbert4csc-base-chinese", 是不是这里的原因。有点不知道自己错哪了。。博客上也只是建议改成 local_files_only=True,

shibing624 commented 7 months ago

不需要动config.json，升级transformers 到最新版本。

bucaiLi commented 7 months ago

不需要动config.json，升级transformers 到最新版本。

还是不行呀哥已经是4.30.2版本的了还是报一样的错，我问我们组师兄说是是不是要指定具体的model文件而不是这个文件夹（就比如/data2/LJ/pycorrector/examples/macbert/bertmodel/pytorch_model.bin）指定这个文件。师兄说下模型会先在文件夹检查有没有模型在决定是否下载，但是我已经改了 cache_dir=model_name_or_path，但还是报错。我实在课题室服务器上跑的demo，所以我是在我的电脑上下载好后上传的服务器，难道是服务器不行，我现在在本地试一下。还有什么地方可能有问题呢老哥

bucaiLi commented 7 months ago

不需要动config.json，升级transformers 到最新版本。

OK了现在没有报下载模型的错误了，我把File "/home/omnisky/anaconda3/envs/LJpycorr/lib/python3.7/site-packages/pycorrector/macbert/macbert_corrector.py", line 27, in init self.tokenizer = BertTokenizerFast.from_pretrained(model_name_or_path) 这个里面的model_name_or_path改成了本地路径，而不是改上面路径里面的macbert_corrector.py的路径模型加载问题解决了，现在有一个 huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': ''/data2/LJ/pycorrector/examples/macbert/bertmodel''. Use repo_type argument if needed. 这样的问题

shibing624 commented 7 months ago

忽略警告。

shibing624 / pycorrector

请问可以在您训练好的Macbert4CSC模型的基础上加入自己的数据集finetune吗？ #287