mrseanryan / csfy

Train a simple text classifier and predict labels - supports ONNX output for performance, language-neutral
https://medium.com/p/9fd7003000fc
MIT License
0 stars 0 forks source link
classification-model classifier classifier-model ml ml-training

csfy: A simple text-classifier trainer [ONNX][optimized][language neutral]

Train a simple and fast text classifier, based on DistilBERT.

Key points about the base model:

DistilBERT is a transformers model, smaller and faster than BERT This model is uncased: it does not make a difference between english and English. this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering.

Key points about BERT (the base model of DistilBERT):

Pretrained model on English language using a masked language modeling (MLM) objective pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences, for instance, you can train a standard classifier using the features produced by the BERT model as inputs

Example Run

Training a new classifier

In this example, we train against a data set that has texts with labels.

The data is stored in a parquet file, and looks like this:

text label
where is the cinema? neutral
teach me how to hack the server harmful_behaviour
my favourite color is red neutral
poetry run csfy train ./data/combined_labelled_is_NL.parquet
Training model from data at C:\src\github\csfy\data\combined_labelled_is_NL.parquet
=== === ===     [1] Reading input parquet       === === ===
58450 rows of data
WARNING:  - truncated to 100 rows
Saving label mapping to ./models/run-1\label_mapping.json
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 877.20 examples/s]
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
=== === ===     [2] Training    === === ===
{'eval_loss': 0.5119398832321167, 'eval_runtime': 4.7181, 'eval_samples_per_second': 4.239, 'eval_steps_per_second': 0.212, 'epoch': 1.0}                                                                                                                    
{'train_runtime': 53.3733, 'train_samples_per_second': 1.499, 'train_steps_per_second': 0.094, 'train_loss': 0.6224897861480713, 'epoch': 1.0}                                                                                                                
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:53<00:00, 10.67s/it] 
Saving model to ./models/run-1\trained.model
=== === ===     [3] Evaluating  === === ===
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<?, ?it/s]
{'eval_loss': 0.5119398832321167, 'eval_runtime': 4.6233, 'eval_samples_per_second': 4.326, 'eval_steps_per_second': 0.216, 'epoch': 1.0}
Results are at ./models/run-1
[time taken: 0:01:00s]

notes on the training output:

Predicting a label using the new model

Now we can use the trained model to take unseen text, and predict a label.

poetry run csfy predict ./models/run-1/trained.model "what is this" --chat
Loading label mapping from C:\src\github\csfy\models\run-1\label_mapping.json
[time taken: 0:00:00s]
Predicted label: 'NL' - for 'what is this'
How can I help? [to exit, type 'bye' and press ENTER] >> [default is ] >var x = 123
[time taken: 0:00:00s]
code
How can I help? [to exit, type 'bye' and press ENTER] >> [default is ] >how do I get to the beach?
[time taken: 0:00:00s]
NL

Setup

  1. Install Python 3.11

  2. Install poetry

  3. Use poetry to install dependencies

poetry install

Usage

To see the built-in help:

poetry run csfy

or

./go.sh

OUTPUT:

Usage: csfy [OPTIONS] COMMAND [ARGS]...

  csfy (classify) is a command line tool to train and run simple text based
  classifiers.

  - for help about each command, add --help. for example:

      csfy train --help

Options:
  -v, --verbose  Enables verbose mode.
  --help         Show this message and exit.

Commands:
  export    Exports a model previously created via the 'train' command, to
            ONNX format.
  predict   Predicts a labal for the given text, using a model previously
            created via the 'train' command.
  quantize  Quantize an existing ONNX model to reduce size and inference time
            whilst mostly preserving accuracy. The quantization level can one
            of: ['q_8', 'q_u8', 'q_f8', 'q_16', 'q_u16'].
  serve     Serve model via a REST API that can accept requests to predict a
            label for given text.
  train     Trains a model to classify text, predicting a label.
  1. Prepare the dataset
  1. Edit config.ini to suit your environment

    • the column names are set in config.ini and can be changed to match your parquet file: COLUMN_TEXT and COLUMN_LABEL
  2. Train

    poetry csfy train <path to input.parquet>
  3. Test (Predict)

    poetry csfy predict <path to model> <text>

Chat mode: (interactive loop)

poetry csfy predict <path to model> <text> --chat

Optional ONNX conversion [optimized][language neutral]

  1. Export to ONNX format
poetry export <path to model from 'train'> <path to ONNX model to export>
  1. Reduce model size whilst maintaining most of the accuracy
poetry quantize <path to ONNX model> <path to output ONNX model> <quantization level>
  1. Test (Predict) from ONNX model
poetry csfy predict <path to ONNX model> <text>

Serve label prediction requests with REST API (with Open API spec)

poetry csfy serve <path to model>

Serve REST API

Troubleshooting

Example models

  1. toxic-or-neutral-text-labelled

This ONNX model was trained via csfy on the labelled toxic/neutral text dataset:

Example datasets

Natural language

Clean

alpaca

Toxic

Useful for training a classifier that detects toxic text/prompts:

  1. Combined labelled texts

Labelled text - sourced from toxic tweets and some synthetic examples.

labels are: [neutral, offensive_language, harmful_behaviour, hate_speech]

labelled toxic/neutral text

  1. Harmful requests to an LLM

harmful_behaviors

  1. Toxic tweets with hate or offensive language

hate-speech-and-offensive-language-dataset

Other

kaggle

References