mim-solutions / bert_for_longer_texts

BERT classification model for processing texts longer than 512 tokens. Text is first divided into smaller chunks and after feeding them to BERT, intermediate results are pooled. The implementation allows fine-tuning.
Other
126 stars 30 forks source link
bert deep-learning machine-learning natural-language-processing nlp pytorch roberta text-classification transfer-learning transformers

BELT (BERT For Longer Texts)

🚀New in version 1.1.0: support for multilabel and regression. See the examples🚀

Project description and motivation

The BELT approach

The BERT model can process texts of the maximal length of 512 tokens (roughly speaking tokens are equivalent to words). It is a consequence of the model architecture and cannot be directly adjusted. Discussion of this issue can be found here. Method to overcome this issue was proposed by Devlin (one of the authors of BERT) in the previously mentioned discussion: comment. The main goal of our project is to implement this method and allow the BERT model to process longer texts during prediction and fine-tuning. We dub this approach BELT (BERT For Longer Texts).

More technical details are described in the documentation. We also prepared the comprehensive blog post: part 1, part 2.

Attention is all you need, but 512 words is all you have

The limitations of the BERT model to the 512 tokens come from the very beginning of the transformers models. Indeed, the attention mechanism, invented in the groundbreaking 2017 paper Attention is all you need, scales quadratically with the sequence length. Unlike RNN or CNN models, which can process sequences of arbitrary length, transformers with the full attention (like BERT) are infeasible (or very expensive) to process long sequences. To overcome the issue, alternative approaches with sparse attention mechanisms were proposed in 2020: BigBird and Longformer.

BELT vs. BigBird vs. LongFormer

Let us now clarify the key differences between the BELT approach to fine-tuning and the sparse attention models BigBird and Longformer:

Installation and dependencies

The project requires Python 3.9+ to run. We recommend training the models on the GPU. Hence, it is necessary to install torch version compatible with the machine. The version of the driver depends on the machine - first, check the version of GPU drivers by the command nvidia-smi and choose the newest version compatible with these drivers according to this table (e.g.: 11.1). Then we install torch to get the compatible build. Here, we find which torch version is compatible with the CUDA version on our machine.

Another option is to use the CPU-only version of torch:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

Next, we recommend installing via pip:

pip3 install belt-nlp

If you want to clone the repo in order to run tests or notebooks, you can use the requirements.txt file.

Model classes

Two main classes are implemented:

Interface

The main methods are:

Loading the pre-trained model

As a default, the standard English bert-base-uncased model is used as a pre-trained model. However, it is possible to use any Bert or Roberta model. To do this, use the parameter pretrained_model_name_or_path. It can be either:

Tests

To make sure everything works properly, run the command pytest tests -rA. As a default, during tests, models are trained on small samples on the CPU.

Examples

All examples use public datasets from huggingface hub.

Binary classification - prediction of sentiment of IMDB reviews

Multilabel classification - recognizing authors of Guardian articles

Regression - prediction of 1 to 5 rating based on reviews from Polish online e-commerce platform Allegro

Contributors

The project was created at MIM AI by:

If you want to contribute to the library, see the contributing info.

Version history

See CHANGELOG.md.

License

See the LICENSE file for license rights and limitations (MIT).

For Maintainers

File requirements.txt can be updated using the command:

bash pip-freeze-without-torch.sh > requirements.txt

This script saves all dependencies of the current active environment except torch.

In order to add the next version of the package to pypi, do the following steps: