hsc748NLP / SikuBERT-for-digital-humanities-and-classical-Chinese-information-processing

SikuBERT:四库全书的预训练语言模型(四库BERT) Pre-training Model of Siku Quanshu
Apache License 2.0
109 stars 15 forks source link

Cannot load SikuBERT via AutoModel #1

Closed KoichiYasuoka closed 3 years ago

KoichiYasuoka commented 3 years ago

I've just tried to load SIKU-BERT/sikubert but failed:

>>> from transformers import AutoModel
>>> model = AutoModel.from_pretrained("SIKU-BERT/sikubert")
404 Client Error: Not Found for url: https://huggingface.co/SIKU-BERT/sikubert/resolve/main/config.json
Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/transformers/configuration_utils.py", line 466, in get_config_dict
    user_agent=user_agent,
  File "/usr/lib/python3.7/site-packages/transformers/file_utils.py", line 1173, in cached_path
    local_files_only=local_files_only,
  File "/usr/lib/python3.7/site-packages/transformers/file_utils.py", line 1336, in get_from_cache
    r.raise_for_status()
  File "/usr/lib/python3.7/site-packages/requests/models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/SIKU-BERT/sikubert/resolve/main/config.json

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.7/site-packages/transformers/models/auto/auto_factory.py", line 355, in from_pretrained
    pretrained_model_name_or_path, return_unused_kwargs=True, **kwargs
  File "/usr/lib/python3.7/site-packages/transformers/models/auto/configuration_auto.py", line 398, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/lib/python3.7/site-packages/transformers/configuration_utils.py", line 478, in get_config_dict
    raise EnvironmentError(msg)
OSError: Can't load config for 'SIKU-BERT/sikubert'. Make sure that:

- 'SIKU-BERT/sikubert' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'SIKU-BERT/sikubert' is the correct path to a directory containing a config.json file

I've checked HuggingFace site and found that the model contains bert_config.json instead of config.json. How do I load the model?

KoichiYasuoka commented 3 years ago

I scrutinized your paper and found that SikuBERT consisted of 28803 characters:

基于深度学习技术,本文所使用的《四库全书》为文渊阁版本的。本次实验的训练集共纳入字数 536097588 个,其中去除重复字后共包含汉字 28803 个。数据集内的汉字均为繁体中文。数据集较《四库全书》全文字数少的原因在于本实验去除了原本中的注释部分,而仅纳入正文部分。

However, vocab.txt of SikuBERT consists of 21128 words, which is less than 28803, including so many non-繁体中文 words. How do I treat the 28803 characters with SikuBERT?

SIKU-BERT commented 3 years ago

Dear @KoichiYasuoka , Thank you for your feedback. We have counted the number of characters in the full text of "Sikuquanshu". First, it is for mutual verification with other scholars' researches to ensure the accuracy of data. The second is to provide a traditional Chinese word list for our upcoming experiment of ab initio training of Bert model. The Sikubert we released on GitHub and HuggingFace is version 1.0. It is mainly based on Bert-base-Chinese and Chinese-Roberta-WWM-Ext two models for secondary fine-tuning. The glossary used is the Chinese version of vocab.txt provided by Google, which still includes the commonly used traditional Chinese.” We are currently working on further experiments, and the Sikubert pre-training model using traditional Chinese word lists will be released in a later version.

I scrutinized your paper and found that SikuBERT consisted of 28803 characters:

基于深度学习技术,本文所使用的《四库全书》为文渊阁版本的。本次实验的训练集共纳入字数 536097588 个,其中去除重复字后共包含汉字 28803 个。数据集内的汉字均为繁体中文。数据集较《四库全书》全文字数少的原因在于本实验去除了原本中的注释部分,而仅纳入正文部分。

However, vocab.txt of SikuBERT consists of 21128 words, which is less than 28803, including so many non-繁体中文 words. How do I treat the 28803 characters with SikuBERT?

SIKU-BERT commented 3 years ago

Dear @KoichiYasuoka , We have named the json file as "bert_config.json", thus you may have to rename the file. Besides, we have already reupload the json file. Now we can use it by the AutoModel way.

Yours Siku

I've just tried to load SIKU-BERT/sikubert but failed:

>>> from transformers import AutoModel
>>> model = AutoModel.from_pretrained("SIKU-BERT/sikubert")
404 Client Error: Not Found for url: https://huggingface.co/SIKU-BERT/sikubert/resolve/main/config.json
Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/transformers/configuration_utils.py", line 466, in get_config_dict
    user_agent=user_agent,
  File "/usr/lib/python3.7/site-packages/transformers/file_utils.py", line 1173, in cached_path
    local_files_only=local_files_only,
  File "/usr/lib/python3.7/site-packages/transformers/file_utils.py", line 1336, in get_from_cache
    r.raise_for_status()
  File "/usr/lib/python3.7/site-packages/requests/models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/SIKU-BERT/sikubert/resolve/main/config.json

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.7/site-packages/transformers/models/auto/auto_factory.py", line 355, in from_pretrained
    pretrained_model_name_or_path, return_unused_kwargs=True, **kwargs
  File "/usr/lib/python3.7/site-packages/transformers/models/auto/configuration_auto.py", line 398, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/lib/python3.7/site-packages/transformers/configuration_utils.py", line 478, in get_config_dict
    raise EnvironmentError(msg)
OSError: Can't load config for 'SIKU-BERT/sikubert'. Make sure that:

- 'SIKU-BERT/sikubert' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'SIKU-BERT/sikubert' is the correct path to a directory containing a config.json file

I've checked HuggingFace site and found that the model contains bert_config.json instead of config.json. How do I load the model?

KoichiYasuoka commented 3 years ago

Thank you @SIKU-BERT for the information and the bug-fix. I've just confirmed that the model works well:

>>> import torch
>>> from transformers import AutoTokenizer,AutoModelForMaskedLM
>>> tokenizer=AutoTokenizer.from_pretrained("SIKU-BERT/sikubert")
>>> model=AutoModelForMaskedLM.from_pretrained("SIKU-BERT/sikubert")
>>> tokens=tokenizer.tokenize("孟子[MASK]梁惠王")
>>> mask=tokens.index("[MASK]")
>>> ids=torch.tensor([tokenizer.convert_tokens_to_ids(tokens)])
>>> with torch.no_grad():
...   outputs=model(ids)
...   pred=outputs[0][0,mask].topk(5)
...
>>> for i,t in enumerate(tokenizer.convert_ids_to_tokens(pred.indices)):
...   tokens[mask]=t
...   print(i+1,tokens)]
...
1 ['孟', '子', '[UNK]', '梁', '惠', '王']
2 ['孟', '子', '上', '梁', '惠', '王']
3 ['孟', '子', '交', '梁', '惠', '王']
4 ['孟', '子', '書', '梁', '惠', '王']
5 ['孟', '子', '間', '梁', '惠', '王']

Thank you again and now I close this issue. I'm looking forward to the later versions.

SIKU-BERT commented 2 years ago

Dear @KoichiYasuoka, Our model with the original list of new words in Siku Quanquan has been released, and the new model adds about 8000 common words in ancient Chinese compared with the old one. Now the original model on Huggingface has been replaced. Foreign users can normally use the same loading method as before.

Thank you @SIKU-BERT for the information and the bug-fix. I've just confirmed that the model works well:

>>> import torch
>>> from transformers import AutoTokenizer,AutoModelForMaskedLM
>>> tokenizer=AutoTokenizer.from_pretrained("SIKU-BERT/sikubert")
>>> model=AutoModelForMaskedLM.from_pretrained("SIKU-BERT/sikubert")
>>> tokens=tokenizer.tokenize("孟子[MASK]梁惠王")
>>> mask=tokens.index("[MASK]")
>>> ids=torch.tensor([tokenizer.convert_tokens_to_ids(tokens)])
>>> with torch.no_grad():
...   outputs=model(ids)
...   pred=outputs[0][0,mask].topk(5)
...
>>> for i,t in enumerate(tokenizer.convert_ids_to_tokens(pred.indices)):
...   tokens[mask]=t
...   print(i+1,tokens)]
...
1 ['孟', '子', '[UNK]', '梁', '惠', '王']
2 ['孟', '子', '上', '梁', '惠', '王']
3 ['孟', '子', '交', '梁', '惠', '王']
4 ['孟', '子', '書', '梁', '惠', '王']
5 ['孟', '子', '間', '梁', '惠', '王']

Thank you again and now I close this issue. I'm looking forward to the later versions.

KoichiYasuoka commented 2 years ago

Thank you @SIKU-BERT for the new release. I've just confirmed sikubert has 29791 words, but sikuroberta's vocab.txt still consists of 21728 words. Does sikuroberta remain old?

SIKU-BERT commented 2 years ago

Thank you @SIKU-BERT for the new release. I've just confirmed sikubert has 29791 words, but sikuroberta's vocab.txt still consists of 21728 words. Does sikuroberta remain old?

Oh,sorry.Maybe the last upload failed, and now the vocab has been re-uploaded. In fact, sikubert and Sikuroberta have the same list of new words.Now you can reuse the model with new vocab.