[TOC]
数字人文研究需要大规模语料库和高性能古文自然语言处理工具支持。预训练语言模型已经在英语和现代汉语文本上极大的提升了文本挖掘的精度,目前亟需专门面向古文自动处理领域的预训练模型。
我们以校验后的高质量《四库全书》全文语料作为训练集,基于BERT深度语言模型框架,构建了面向古文智能处理任务的SikuBERT和SikuRoBERTa预训练语言模型。
我们设计了面向《左传》语料的古文自动分词、断句标点、词性标注和命名实体识别4个下游任务,验证模型性能。
SikuBERT
和SikuRoBERTa
基于《四库全书》
语料训练,《四库全书》又称《钦定四库全书》,是清代乾隆时期编修的大型丛书。实验去除了原本中的注释部分,仅纳入正文部分,参与实验的训练集共纳入字数达536,097,588
个,数据集内的汉字均为繁体中文。
基于领域适应训练(Domain-Adaptive Pretraining)的思想,SikuBERT
和SikuRoBERTa
在BERT结构的基础上结合大量古文语料,分别继续训练BERT和RoBERTa模型,以获取面向古文自动处理领域的预训练模型。
2023/7
"Guji"系列模型于2023年中国情报学年会和比特人文公众号正式发布。新版"Guji"系列模型共包含3类9种预训练模型,适应古籍处理研究者不同的处理偏好。
"Guji"系列模型下载地址和具体介绍详见:https://github.com/hsc748NLP/GujiBERT-and-GujiGPT, https://huggingface.co/hsc748NLP
下载链接:https://pan.baidu.com/s/1--S-qyUedIvhBKwapQjPsA 提取码:m36d
平台的使用方法见“使用方法”文件夹,目前版本支持分词,断句,实体识别,文本分类,词性标注和自动标点六种功能,提供单文本处理和语料库处理两种文本处理模式,欢迎下载使用!
古文生成预训练模型 SikuGPT2:https://huggingface.co/JeffreyLau/SikuGPT2
古诗词生成预训练模型 SikuGPT2-poem:https://huggingface.co/JeffreyLau/SikuGPT2-poem
论文:AIGC助力数字人文研究的实践探索:SikuGPT驱动的古诗词生成研究:https://kns.cnki.net/kcms/detail/11.1762.G3.20230426.1046.002.html
基于Huggingface Transformers的from_pretrained
方法可以直接在线获取SikuBERT和SikuRoBERTa模型。
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("SIKU-BERT/sikubert")
model = AutoModel.from_pretrained("SIKU-BERT/sikubert")
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("SIKU-BERT/sikuroberta")
model = AutoModel.from_pretrained("SIKU-BERT/sikuroberta")
PyTorch
版本。通过Huggingface官网直接下载,目前官网的模型已同步更新至最新版本:
SikuRoBERTa: https://huggingface.co/SIKU-BERT/sikuroberta
旧版下载地址:
模型名称 | 网盘链接 |
---|---|
sikubert | 链接 提取码: jn94 |
sikuroberta | 链接 提取码: ihgq |
拥有新词表的sikubert和sikuroberta下载方式已更新:
模型名称 | 网盘链接 |
---|---|
sikubert_vocabtxt(推荐下载) | 链接 提取码: v68d |
sikuroberta_vocabtxt(推荐下载) | 链接 提取码: 93cr |
任务名 task type | 预训练模型pretrained models | 精确率(P) | 召回率(R) | 调和平均值(F1) |
---|---|---|---|---|
分词 Participle | BERT-base-chinese | 86.99% | 88.15% | 87.56% |
RoBERTa | 80.90% | 84.77% | 82.79% | |
SikuBERT | 88.62% | 89.08% | 88.84% | |
SikuRoBERTa | 88.48% | 89.03% | 88.88% | |
词性标注 POS tag | BERT-base-chinese | 89.51% | 90.10% | 89.73% |
RoBERTa | 86.70% | 88.45% | 87.50% | |
SikuBERT | 89.89% | 90.41% | 90.10% | |
SikuRoBERTa | 89.74% | 90.49% | 90.06% | |
断句 Segmentation | BERT-base-chinese | 78.77% | 78.63% | 78.70% |
RoBERTa | 66.71% | 66.38% | 66.54% | |
SikuBERT | 87.38% | 87.68% | 87.53% | |
SikuRoBERTa | 86.81% | 87.02% | 86.91% |
任务名 task type | 预训练模型pretrained models | 实体名 entity names | 精确率(P) | 召回率(R) | 调和平均值(F1) |
---|---|---|---|---|---|
实体识别 NER | BERT-base-chinese | nr(人名) | 86.66% | 87.35% | 87.00% |
ns(地名) | 83.99% | 87.00% | 85.47% | ||
t(时间) | 96.96% | 95.15% | 96.05% | ||
avg/prf | 86.99% | 88.15% | 87.56% | ||
RoBERTa | nr(人名) | 79.88% | 83.69% | 81.74% | |
ns(地名) | 78.86% | 84.08% | 81.39% | ||
t(时间) | 91.45% | 91.79% | 91.62% | ||
avg/prf | 80.90% | 84.77% | 82.79% | ||
SikuBERT | nr(人名) | 88.65% | 88.23% | 88.44% | |
ns(地名) | 85.48% | 88.20% | 86.81% | ||
t(时间) | 97.34% | 95.52% | 96.42% | ||
avg/prf | 88.62% | 89.08% | 88.84% | ||
SikuRoBERTa | nr(人名) | 87.74% | 88.23% | 87.98% | |
ns(地名) | 86.55% | 88.73% | 87.62% | ||
t(时间) | 97.35% | 95.90% | 96.62% | ||
avg/prf | 88.48% | 89.30% | 88.88% |
GB/T 7714-2015格式: [1]王东波, 刘畅, 朱子赫, 等. SikuBERT与SikuRoBERTa:面向数字人文的《四库全书》预训练模型构建及应用研究[J]. 图书馆论坛, 2022, 42(06): 31-43.
[2]刘江峰, 刘雏菲, 齐月, 等. AIGC助力数字人文研究的实践探索:SikuGPT驱动的古诗词生成研究[J/OL]. 情报理论与实践: 1-12[2023-04-27]. http://kns.cnki.net/kcms/detail/11.1762.G3.20230426.1046.002.html.
RoBERTa-wwm-ext
继续训练的。Digital humanities research needs the support of large-scale corpus and high-performance ancient Chinese natural language processing tools. The pre-training language model has greatly improved the accuracy of text mining in English and modern Chinese texts. At present, there is an urgent need for a pre-training model specifically for the automatic processing of ancient texts.
We used the verified high-quality "Siku Quanshu" full text corpus as the training set, and based on the BERT deep language model framework, we constructed SikuBERT and SikuRoBERTa pre-training language models for intelligent processing tasks of ancient Chinese.
We designed four downstream tasks of automatic word segmentation, segmentation punctuation, part-of-speech tagging, and named entity recognition for ancient Chinese corpus for "Zuo Zhuan" to verify the performance of the model.
SikuBERT
and SikuRoBERTa
are trained on the corpus of "Siku Quanshu
". "Siku Quanshu", also known as "King Ding Siku Quanshu", is a large-scale series of books compiled during the Qianlong period of the Qing Dynasty. The experiment removed the original annotation part and only included the text part. The training set involved in the experiment included a total of 536,097,588
characters, and the Chinese characters in the data set were all traditional Chinese. SikuBERT
and SikuRoBERTa
combine a large amount of ancient text corpus based on the BERT structure, and continue to train the BERT and RoBERTa models respectively to obtain pre-training models for the automatic processing of ancient texts. Download link: https://pan.baidu.com/s/1--S-qyUedIvhBKwapQjPsA Extraction code: m36d
Please refer to the "How to use" folder for the usage of the platform. The current version supports six functions of word segmentation, sentence segmentation, entity recognition, text classification, part-of-speech tagging and automatic punctuation. It provides two text processing modes: single text processing and corpus processing. Welcome to download and use !
SikuGPT2: https://huggingface.co/JeffreyLau/SikuGPT2
SikuGPT2-poem: https://huggingface.co/JeffreyLau/SikuGPT2-poem
A Practical Exploration ofAIGC-Powered Digital Humanities Research: A SikuGPT Driven Research of Ancient Poetry Generation:https://kns.cnki.net/kcms/detail/11.1762.G3.20230426.1046.002.html
The from_pretrained
method based on Huggingface Transformers can directly obtain SikuBERT and SikuRoBERTa models online.
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("SIKU-BERT/sikubert")
model = AutoModel.from_pretrained("SIKU-BERT/sikubert")
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("SIKU-BERT/sikuroberta")
model = AutoModel.from_pretrained("SIKU-BERT/sikuroberta")
PyTorch
version. Download directly through Huggingface's official website, and the model on the official website has been updated to the latest version simultaneously:
SikuRoBERTa: https://huggingface.co/SIKU-BERT/sikuroberta
If you are not in China, We put the model on Google drive for users to downloa.
Old version download address:
The download method of sikubert and sikuroberta with new vocabulary has been updated:
Model | Link |
---|---|
sikubert_vocabtxt(Recommended download ) | https://drive.google.com/drive/folders/1uA7m54Cz7ZhNGxFM_DsQTpElb9Ns77R5?usp=sharing |
sikuroberta_vocabtxt(Recommended download ) | https://drive.google.com/drive/folders/1i0ldNODE1NC25Wzv0r7v1Thda8NscK3e?usp=sharing |
task type | pretrained models | (P) | (R) | (F1) |
---|---|---|---|---|
Participle | BERT-base-chinese | 86.99% | 88.15% | 87.56% |
RoBERTa | 80.90% | 84.77% | 82.79% | |
SikuBERT | 88.62% | 89.08% | 88.84% | |
SikuRoBERTa | 88.48% | 89.03% | 88.88% | |
POS tag | BERT-base-chinese | 89.51% | 90.10% | 89.73% |
RoBERTa | 86.70% | 88.45% | 87.50% | |
SikuBERT | 89.89% | 90.41% | 90.10% | |
SikuRoBERTa | 89.74% | 90.49% | 90.06% | |
Segmentation | BERT-base-chinese | 78.77% | 78.63% | 78.70% |
RoBERTa | 66.71% | 66.38% | 66.54% | |
SikuBERT | 87.38% | 87.68% | 87.53% | |
SikuRoBERTa | 86.81% | 87.02% | 86.91% |
task type | pretrained models | entity names | (P) | (R) | (F1) |
---|---|---|---|---|---|
NER | BERT-base-chinese | nr(people name) | 86.66% | 87.35% | 87.00% |
ns(place name) | 83.99% | 87.00% | 85.47% | ||
t(time) | 96.96% | 95.15% | 96.05% | ||
avg/prf | 86.99% | 88.15% | 87.56% | ||
RoBERTa | nr(people name) | 79.88% | 83.69% | 81.74% | |
ns(place name) | 78.86% | 84.08% | 81.39% | ||
t(time) | 91.45% | 91.79% | 91.62% | ||
avg/prf | 80.90% | 84.77% | 82.79% | ||
SikuBERT | nr(people name) | 88.65% | 88.23% | 88.44% | |
ns(place name) | 85.48% | 88.20% | 86.81% | ||
t(time) | 97.34% | 95.52% | 96.42% | ||
avg/prf | 88.62% | 89.08% | 88.84% | ||
SikuRoBERTa | nr(people name) | 87.74% | 88.23% | 87.98% | |
ns(place name) | 86.55% | 88.73% | 87.62% | ||
t(time) | 97.35% | 95.90% | 96.62% | ||
avg/prf | 88.48% | 89.30% | 88.88% |
GB/T 7714-2015格式: [1]王东波, 刘畅, 朱子赫, 等. SikuBERT与SikuRoBERTa:面向数字人文的《四库全书》预训练模型构建及应用研究[J]. 图书馆论坛, 2022, 42(06): 31-43.
[2]刘江峰, 刘雏菲, 齐月, 等. AIGC助力数字人文研究的实践探索:SikuGPT驱动的古诗词生成研究[J/OL]. 情报理论与实践: 1-12[2023-04-27]. http://kns.cnki.net/kcms/detail/11.1762.G3.20230426.1046.002.html.
[3] Wang D, Liu C, Zhao Z, et al. GujiBERT and GujiGPT: Construction of Intelligent Information Processing Foundation Language Models for Ancient Texts[J]. arXiv preprint arXiv:2307.05354, 2023.
RoBERTa-wwm-ext
in the 中文BERT-wwm project.