lonePatient / BERT-chinese-text-classification-pytorch

This repo contains a PyTorch implementation of a pretrained BERT model for text classification.
99 stars 19 forks source link
bert chinese chinese-text-classification nlp pytorch text-classification

BERT Chinese text classification by PyTorch

This repo contains a PyTorch implementation of a pretrained BERT model for chinese text classification.

Structure of the code

At the root of the project, you will see:

├── pybert
|  └── callback
|  |  └── lrscheduler.py  
|  |  └── trainingmonitor.py 
|  |  └── ...
|  └── config
|  |  └── base.py #a configuration file for storing model parameters
|  └── dataset   
|  └── io    
|  |  └── bert_processor.py
|  └── model
|  |  └── nn 
|  |  └── pretrain 
|  └── output #save the ouput of model
|  └── preprocessing #text preprocessing 
|  └── train #used for training a model
|  |  └── trainer.py 
|  |  └── ...
|  └── utils # a set of utility functions
├── run_bert.py

Dependencies

How to use the code

you need download pretrained chinese bert model

  1. Download the Bert pretrained model from s3
  2. Download the Bert config file from s3
  3. Download the Bert vocab file from s3
  4. modify bert-base-chinese-pytorch_model.bin to pytorch_model.bin , bert-base-chinese-config.json to config.json ,bert-base-chinese-vocab.txt to vocab.txt
  5. place model ,config and vocab file into the /pybert/pretrain/bert/base-uncased directory.
  6. pip install pytorch-transformers from github.
  7. Prepare BaiduNet{password:ruxu}, you can modify the io.bert_processor.py to adapt your data.
  8. Modify configuration information in pybert/config/base.py(the path of data,...).
  9. Run python run_bert.py --do_data to preprocess data.
  10. Run python run_bert.py --do_train --save_best to fine tuning bert model.
  11. Run run_bert.py --do_test --do_lower_case to predict new data.

Fine-tuning result

training

Epoch: 3 - loss: 0.0222 acc: 0.9939 - f1: 0.9911 val_loss: 0.0785 - val_acc: 0.9799 - val_f1: 0.9800

classify_report

label precision recall f1-score support
财经 0.97 0.96 0.96 1500
体育 1.00 1.00 1.00 1500
娱乐 0.99 0.99 0.99 1500
家居 0.99 0.99 0.99 1500
房产 0.96 0.97 0.96 1500
教育 0.98 0.97 0.97 1500
时尚 0.99 0.98 0.99 1500
时政 0.97 0.98 0.98 1500
游戏 1.00 0.99 0.99 1500
科技 0.96 0.97 0.97 1500
avg / total 0.98 0.98 0.98 15000

training figure

Tips