hsc748NLP / SikuBERT-for-digital-humanities-and-classical-Chinese-information-processing

SikuBERT:四库全书的预训练语言模型(四库BERT) Pre-training Model of Siku Quanshu
Apache License 2.0
110 stars 15 forks source link

GitHub license









下载链接:https://pan.baidu.com/s/1--S-qyUedIvhBKwapQjPsA 提取码:m36d



古文生成预训练模型 SikuGPT2https://huggingface.co/JeffreyLau/SikuGPT2

古诗词生成预训练模型 SikuGPT2-poemhttps://huggingface.co/JeffreyLau/SikuGPT2-poem



Huggingface Transformers

基于Huggingface Transformersfrom_pretrained方法可以直接在线获取SikuBERT和SikuRoBERTa模型。

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("SIKU-BERT/sikubert")

model = AutoModel.from_pretrained("SIKU-BERT/sikubert")
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("SIKU-BERT/sikuroberta")

model = AutoModel.from_pretrained("SIKU-BERT/sikuroberta")





模型名称 网盘链接
sikubert 链接 提取码: jn94
sikuroberta 链接 提取码: ihgq


模型名称 网盘链接
sikubert_vocabtxt(推荐下载) 链接 提取码: v68d
sikuroberta_vocabtxt(推荐下载) 链接 提取码: 93cr


任务名 task type 预训练模型pretrained models 精确率(P) 召回率(R) 调和平均值(F1)
分词 Participle BERT-base-chinese 86.99% 88.15% 87.56%
RoBERTa 80.90% 84.77% 82.79%
SikuBERT 88.62% 89.08% 88.84%
SikuRoBERTa 88.48% 89.03% 88.88%
词性标注 POS tag BERT-base-chinese 89.51% 90.10% 89.73%
RoBERTa 86.70% 88.45% 87.50%
SikuBERT 89.89% 90.41% 90.10%
SikuRoBERTa 89.74% 90.49% 90.06%
断句 Segmentation BERT-base-chinese 78.77% 78.63% 78.70%
RoBERTa 66.71% 66.38% 66.54%
SikuBERT 87.38% 87.68% 87.53%
SikuRoBERTa 86.81% 87.02% 86.91%
任务名 task type 预训练模型pretrained models 实体名 entity names 精确率(P) 召回率(R) 调和平均值(F1)
实体识别 NER BERT-base-chinese nr(人名) 86.66% 87.35% 87.00%
ns(地名) 83.99% 87.00% 85.47%
t(时间) 96.96% 95.15% 96.05%
avg/prf 86.99% 88.15% 87.56%
RoBERTa nr(人名) 79.88% 83.69% 81.74%
ns(地名) 78.86% 84.08% 81.39%
t(时间) 91.45% 91.79% 91.62%
avg/prf 80.90% 84.77% 82.79%
SikuBERT nr(人名) 88.65% 88.23% 88.44%
ns(地名) 85.48% 88.20% 86.81%
t(时间) 97.34% 95.52% 96.42%
avg/prf 88.62% 89.08% 88.84%
SikuRoBERTa nr(人名) 87.74% 88.23% 87.98%
ns(地名) 86.55% 88.73% 87.62%
t(时间) 97.35% 95.90% 96.62%
avg/prf 88.48% 89.30% 88.88%





English version


Digital humanities research needs the support of large-scale corpus and high-performance ancient Chinese natural language processing tools. The pre-training language model has greatly improved the accuracy of text mining in English and modern Chinese texts. At present, there is an urgent need for a pre-training model specifically for the automatic processing of ancient texts.

We used the verified high-quality "Siku Quanshu" full text corpus as the training set, and based on the BERT deep language model framework, we constructed SikuBERT and SikuRoBERTa pre-training language models for intelligent processing tasks of ancient Chinese.

We designed four downstream tasks of automatic word segmentation, segmentation punctuation, part-of-speech tagging, and named entity recognition for ancient Chinese corpus for "Zuo Zhuan" to verify the performance of the model.


The sikuaip version 1.0 of the ancient book intelligent processing platform for digital humanities has been officially released

Download link: https://pan.baidu.com/s/1--S-qyUedIvhBKwapQjPsA Extraction code: m36d

Please refer to the "How to use" folder for the usage of the platform. The current version supports six functions of word segmentation, sentence segmentation, entity recognition, text classification, part-of-speech tagging and automatic punctuation. It provides two text processing modes: single text processing and corpus processing. Welcome to download and use !

Pre-training model of ancient Chinese text generation and ancient Chinese poetry generation

SikuGPT2: https://huggingface.co/JeffreyLau/SikuGPT2

SikuGPT2-poem: https://huggingface.co/JeffreyLau/SikuGPT2-poem

A Practical Exploration ofAIGC-Powered Digital Humanities Research: A SikuGPT Driven Research of Ancient Poetry Generationhttps://kns.cnki.net/kcms/detail/11.1762.G3.20230426.1046.002.html

How to use

Huggingface Transformers

The from_pretrained method based on Huggingface Transformers can directly obtain SikuBERT and SikuRoBERTa models online.

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("SIKU-BERT/sikubert")

model = AutoModel.from_pretrained("SIKU-BERT/sikubert")
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("SIKU-BERT/sikuroberta")

model = AutoModel.from_pretrained("SIKU-BERT/sikuroberta")

Download PTM

From Huggingface

From Google Drive

If you are not in China, We put the model on Google drive for users to downloa.

Old version download address:

Model Link
sikubert https://drive.google.com/drive/folders/1blElNRhouuaU-ZGA99ahud1QL7Y-7PEZ?usp=sharing
sikuroberta https://drive.google.com/drive/folders/13ToN58XfsfHIIj7pjLWNgLvWAqHsb0_a?usp=sharing

The download method of sikubert and sikuroberta with new vocabulary has been updated:

Model Link
sikubert_vocabtxt(Recommended download ) https://drive.google.com/drive/folders/1uA7m54Cz7ZhNGxFM_DsQTpElb9Ns77R5?usp=sharing
sikuroberta_vocabtxt(Recommended download ) https://drive.google.com/drive/folders/1i0ldNODE1NC25Wzv0r7v1Thda8NscK3e?usp=sharing

Evaluation & Results

task type pretrained models (P) (R) (F1)
Participle BERT-base-chinese 86.99% 88.15% 87.56%
RoBERTa 80.90% 84.77% 82.79%
SikuBERT 88.62% 89.08% 88.84%
SikuRoBERTa 88.48% 89.03% 88.88%
POS tag BERT-base-chinese 89.51% 90.10% 89.73%
RoBERTa 86.70% 88.45% 87.50%
SikuBERT 89.89% 90.41% 90.10%
SikuRoBERTa 89.74% 90.49% 90.06%
Segmentation BERT-base-chinese 78.77% 78.63% 78.70%
RoBERTa 66.71% 66.38% 66.54%
SikuBERT 87.38% 87.68% 87.53%
SikuRoBERTa 86.81% 87.02% 86.91%
task type pretrained models entity names (P) (R) (F1)
NER BERT-base-chinese nr(people name) 86.66% 87.35% 87.00%
ns(place name) 83.99% 87.00% 85.47%
t(time) 96.96% 95.15% 96.05%
avg/prf 86.99% 88.15% 87.56%
RoBERTa nr(people name) 79.88% 83.69% 81.74%
ns(place name) 78.86% 84.08% 81.39%
t(time) 91.45% 91.79% 91.62%
avg/prf 80.90% 84.77% 82.79%
SikuBERT nr(people name) 88.65% 88.23% 88.44%
ns(place name) 85.48% 88.20% 86.81%
t(time) 97.34% 95.52% 96.42%
avg/prf 88.62% 89.08% 88.84%
SikuRoBERTa nr(people name) 87.74% 88.23% 87.98%
ns(place name) 86.55% 88.73% 87.62%
t(time) 97.35% 95.90% 96.62%
avg/prf 88.48% 89.30% 88.88%




Contact us