This repo contains a PyTorch implementation of a pretrained BERT model for chinese text classification.
At the root of the project, you will see:
├── pybert
| └── callback
| | └── lrscheduler.py
| | └── trainingmonitor.py
| | └── ...
| └── config
| | └── base.py #a configuration file for storing model parameters
| └── dataset
| └── io
| | └── bert_processor.py
| └── model
| | └── nn
| | └── pretrain
| └── output #save the ouput of model
| └── preprocessing #text preprocessing
| └── train #used for training a model
| | └── trainer.py
| | └── ...
| └── utils # a set of utility functions
├── run_bert.py
you need download pretrained chinese bert model
bert-base-chinese-pytorch_model.bin
to pytorch_model.bin
, bert-base-chinese-config.json
to config.json
,bert-base-chinese-vocab.txt
to vocab.txt
model
,config
and vocab
file into the /pybert/pretrain/bert/base-uncased
directory.pip install pytorch-transformers
from github.io.bert_processor.py
to adapt your data.pybert/config/base.py
(the path of data,...).python run_bert.py --do_data
to preprocess data.python run_bert.py --do_train --save_best
to fine tuning bert model.run_bert.py --do_test --do_lower_case
to predict new data.Epoch: 3 - loss: 0.0222 acc: 0.9939 - f1: 0.9911 val_loss: 0.0785 - val_acc: 0.9799 - val_f1: 0.9800
label | precision | recall | f1-score | support |
---|---|---|---|---|
财经 | 0.97 | 0.96 | 0.96 | 1500 |
体育 | 1.00 | 1.00 | 1.00 | 1500 |
娱乐 | 0.99 | 0.99 | 0.99 | 1500 |
家居 | 0.99 | 0.99 | 0.99 | 1500 |
房产 | 0.96 | 0.97 | 0.96 | 1500 |
教育 | 0.98 | 0.97 | 0.97 | 1500 |
时尚 | 0.99 | 0.98 | 0.99 | 1500 |
时政 | 0.97 | 0.98 | 0.98 | 1500 |
游戏 | 1.00 | 0.99 | 0.99 | 1500 |
科技 | 0.96 | 0.97 | 0.97 | 1500 |
avg / total | 0.98 | 0.98 | 0.98 | 15000 |