sentence-transformers/README.md at master · liuyukid/sentence-transformers
Description
Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch. BERT / RoBERTa / XLM-RoBERTa produces out-of-the-box rather bad sentence embeddings. This repository fine-tunes BERT / RoBERTa / DistilBERT / ALBERT / XLNet with a siamese or triplet network structure to produce semantically meaningful sentence embeddings that can be used in unsupervised scenarios: Semantic textual similarity via cosine-similarity, clustering, semantic search.
You can use this code to easily train your own sentence embeddings, that are tuned for your specific task. We provide various dataset readers and you can tune sentence embeddings with different loss function, depending on the structure of your dataset. For further details, see Train your own Sentence Embeddings.
Setup
We recommend Python 3.6 or higher. The model is implemented with PyTorch (at least 1.2.0) using transformers v3.0.2. The code does not work with Python 2.7.
With pip
Install the model with pip:
pip install -U sentence-transformers
From source
Clone this repository and install it with pip:
pip install -e .
Getting Started
Sentences Embedding with a Pretrained Model
This example shows you how to use an already trained Sentence Transformer model to embed sentences for another task.
First download a pretrained model.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')
Then provide some sentences to the model.
sentences = ['This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.']
sentence_embeddings = model.encode(sentences)
And that's it already. We now have a list of numpy arrays with the embeddings.
for sentence, embedding in zip(sentences, sentence_embeddings):
print("Sentence:", sentence)
print("Embedding:", embedding)
print("")
Training
This framework allows you to fine-tune your own sentence embedding methods, so that you get task-specific sentence embeddings. You have various options to choose from in order to get perfect sentence embeddings for your specific task.
It will download some datasets and store them on your disk.
Model Training from Scratch
training_nli.py fine-tunes BERT (and other transformer models) from the pre-trained model as provided by Google & Co. It tunes the model on Natural Language Inference (NLI) data. Given two sentences, the model should classify if these two sentence entail, contradict, or are neutral to each other. For this, the two sentences are passed to a transformer model to generate fixed-sized sentence embeddings. These sentence embeddings are then passed to a softmax classifier to derive the final label (entail, contradict, neutral). This generates sentence embeddings that are useful also for other tasks like clustering or semantic textual similarity.
First, we define a sequential model of how a sentence is mapped to a fixed size sentence embedding:
# Use BERT for mapping tokens to embeddings
word_embedding_model = models.Transformer('bert-base-uncased')
# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
First, we use the BERT model (instantiated from bert-base-uncased) to map tokens in a sentence to the output embeddings from BERT. The next layer in our model is a Pooling model: In that case, we perform mean-pooling. You can also perform max-pooling or use the embedding from the CLS token. You can also combine multiple poolings together.
These two modules (word_embedding_model and pooling_model) form our SentenceTransformer. Each sentence is now passed first through the word_embedding_model and then through the pooling_model to give fixed sized sentence vectors.
The NLIDataReader reads the AllNLI dataset and we generate a dataloader that is suitable for training the Sentence Transformer model. As training loss, we use a Softmax Classifier.
Next, we also specify a dev-set. The dev-set is used to evaluate the sentence embedding model on some unseen data. Note, the dev-set can be any data, in this case, we evaluate on the dev-set of the STS benchmark dataset. The evaluator computes the performance metric, in this case, the cosine-similarity between sentence embeddings are computed and the Spearman-correlation to the gold scores is computed.
training_stsbenchmark_continue_training.py shows an example where training on a fine-tuned model is continued. In that example, we use a sentence transformer model that was first fine-tuned on the NLI dataset and then continue training on the training data from the STS benchmark.
First, we load a pre-trained model from the server:
model = SentenceTransformer('bert-base-nli-mean-tokens')
The next steps are as before. We specify training and dev data:
In that example, we use CosineSimilarityLoss, which computes the cosine similarity between two sentences and compares this score with a provided gold similarity score.
Loading trained models is easy. You can specify a path:
model = SentenceTransformer('./my/path/to/model/')
Note: It is important that a / or \ is the path, otherwise, it is not recognized as a path.
You can also host the training output on a server and download it:
model = SentenceTransformer('http://www.server.com/path/to/model/my_model.zip')
With the first call, the model is downloaded and stored in the local torch cache-folder (~/.cache/torch/sentence_transformers). In order to work, you must zip all files and subfolders of your model.
We also provide several pre-trained models, that can be loaded by just passing a name:
model = SentenceTransformer('bert-base-nli-mean-tokens')
This downloads the bert-base-nli-mean-tokens from our server and stores it locally.
Loading custom BERT models
If you have fine-tuned BERT (or similar models) and you want to use it to generate sentence embeddings, you must construct an appropriate sentence transformer model from it. This is possible by using this code:
from sentence_transformers import models
# Use BERT for mapping tokens to embeddings
word_embedding_model = models.Transformer('path/to/your/BERT/model')
# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
pooling_mode_mean_tokens=True,
pooling_mode_cls_token=False,
pooling_mode_max_tokens=False)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
We provide the following models. You can use them in the following way:
model = SentenceTransformer('name_of_model')
English Pre-Trained Models
In the following you find selected models that were trained on English data only. For the full list of available models, see SentenceTransformer Pretrained Models. See the next section for multi-lingual models.
Trained on NLI data
These models were trained on SNLI and MultiNLI dataset to create universal sentence embeddings. For more details, see: nli-models.md.
bert-base-nli-mean-tokens: BERT-base model with mean-tokens pooling. Performance: STSbenchmark: 77.12
bert-large-nli-mean-tokens: BERT-large with mean-tokens pooling. Performance: STSbenchmark: 79.19
roberta-base-nli-mean-tokens: RoBERTa-base with mean-tokens pooling. Performance: STSbenchmark: 77.49
roberta-large-nli-mean-tokens: RoBERTa-base with mean-tokens pooling. Performance: STSbenchmark: 78.69
distilbert-base-nli-mean-tokens: DistilBERT-base with mean-tokens pooling. Performance: STSbenchmark: 76.97
Trained on STS data
These models were first fine-tuned on the AllNLI datasent, then on train set of STS benchmark. They are specifically well suited for semantic textual similarity. For more details, see: sts-models.md.
The following models can be used for languages other than English. The vector spaces for the included languages are aligned, i.e., two sentences are mapped to the same point in vector space independent of the language. The models can be used for cross-lingual tasks. For more details see multilingual-models.md.
xlm-r-base-en-ko-nli-ststb: Supported languages: English, Korean. Performance on Korean STSbenchmark: 81.47
xlm-r-large-en-ko-nli-ststb: Supported languages: English, Korean. Performance on Korean STSbenchmark: 84.05
xlm-r-40langs-bert-base-nli-mean-tokens: Produces similar embeddings as the bert-base-nli-mean-token for 40 languages: ar, bg, ca, cs, da, de, el, en, es, fa, fi, fr, he, hi, hr, hu, id, it, ja, ko, ku, lt, lv, my, nl, pl, pt, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, vi, zh
xlm-r-40langs-bert-base-nli-stsb-mean-tokens: Produces similar embeddings as the bert-base-nli-stsb-mean-token for 40 supported languages: ar, bg, ca, cs, da, de, el, en, es, fa, fi, fr, he, hi, hr, hu, id, it, ja, ko, ku, lt, lv, my, nl, pl, pt, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, vi, zh
Performance
Extensive evaluation is currently undergoing, but here we provide some preliminary results.
Model
STS benchmark
SentEval
Avg. GloVe embeddings
58.02
81.52
BERT-as-a-service avg. embeddings
46.35
84.04
BERT-as-a-service CLS-vector
16.50
84.66
InferSent - GloVe
68.03
85.59
Universal Sentence Encoder
74.92
85.10
Sentence Transformer Models
bert-base-nli-mean-tokens
77.12
86.37
bert-large-nli-mean-tokens
79.19
87.78
bert-base-nli-stsb-mean-tokens
85.14
86.07
bert-large-nli-stsb-mean-tokens
85.29
86.66
Loss Functions
We implemented various loss-functions that allow training of sentence embeddings from various datasets. These loss-functions are in the package sentence_transformers.losses.
SoftmaxLoss: Given the sentence embeddings of two sentences, trains a softmax-classifier. Useful for training on datasets like NLI.
CosineSimilarityLoss: Given a sentence pair and a gold similarity score (either between -1 and 1 or between 0 and 1), computes the cosine similarity between the sentence embeddings and minimizes the mean squared error loss.
TripletLoss: Given a triplet (anchor, positive example, negative example), minimizes the triplet loss.
BatchHardTripletLoss: Implements the batch hard triplet loss from the paper In Defense of the Triplet Loss for Person Re-Identification. Each batch must contain multiple examples from the same class. The loss optimizes then the distance between the most-distance positive pair and the closest negative-pair.
This framework implements various modules, that can be used sequentially to map a sentence to a sentence embedding. The different modules can be found in the package sentence_transformers.models. Each pipeline consists of the following modules.
Word Embeddings: These models map tokens to token embeddings.
Transformer: You can use any huggingface pretrained models including BERT, RoBERTa, DistilBERT, ALBERT, XLNet, XLM-RoBERTa, ELECTRA, FlauBERT, CamemBERT...
{'label-name': 'training-framework', 'label-description': 'Frameworks and tools for training and fine-tuning sentence embeddings for specific tasks.', 'confidence': 58.47}
sentence-transformers/README.md at master · liuyukid/sentence-transformers
Description
Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch. BERT / RoBERTa / XLM-RoBERTa produces out-of-the-box rather bad sentence embeddings. This repository fine-tunes BERT / RoBERTa / DistilBERT / ALBERT / XLNet with a siamese or triplet network structure to produce semantically meaningful sentence embeddings that can be used in unsupervised scenarios: Semantic textual similarity via cosine-similarity, clustering, semantic search.
We provide an increasing number of state-of-the-art pretrained models that can be used to derive sentence embeddings. See Pretrained Models. Details of the implemented approaches can be found in our publication: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (EMNLP 2019).
You can use this code to easily train your own sentence embeddings, that are tuned for your specific task. We provide various dataset readers and you can tune sentence embeddings with different loss function, depending on the structure of your dataset. For further details, see Train your own Sentence Embeddings.
Setup
We recommend Python 3.6 or higher. The model is implemented with PyTorch (at least 1.2.0) using transformers v3.0.2. The code does not work with Python 2.7.
With pip
Install the model with
pip
:From source
Clone this repository and install it with
pip
:Getting Started
Sentences Embedding with a Pretrained Model
This example shows you how to use an already trained Sentence Transformer model to embed sentences for another task.
First download a pretrained model.
Then provide some sentences to the model.
And that's it already. We now have a list of numpy arrays with the embeddings.
Training
This framework allows you to fine-tune your own sentence embedding methods, so that you get task-specific sentence embeddings. You have various options to choose from in order to get perfect sentence embeddings for your specific task.
Dataset Download
First, you should download some datasets. For this run the examples/datasets/get_data.py:
It will download some datasets and store them on your disk.
Model Training from Scratch
training_nli.py fine-tunes BERT (and other transformer models) from the pre-trained model as provided by Google & Co. It tunes the model on Natural Language Inference (NLI) data. Given two sentences, the model should classify if these two sentence entail, contradict, or are neutral to each other. For this, the two sentences are passed to a transformer model to generate fixed-sized sentence embeddings. These sentence embeddings are then passed to a softmax classifier to derive the final label (entail, contradict, neutral). This generates sentence embeddings that are useful also for other tasks like clustering or semantic textual similarity.
First, we define a sequential model of how a sentence is mapped to a fixed size sentence embedding:
First, we use the BERT model (instantiated from bert-base-uncased) to map tokens in a sentence to the output embeddings from BERT. The next layer in our model is a Pooling model: In that case, we perform mean-pooling. You can also perform max-pooling or use the embedding from the CLS token. You can also combine multiple poolings together.
These two modules (word_embedding_model and pooling_model) form our SentenceTransformer. Each sentence is now passed first through the word_embedding_model and then through the pooling_model to give fixed sized sentence vectors.
Next, we specify a train dataloader:
The
NLIDataReader
reads the AllNLI dataset and we generate a dataloader that is suitable for training the Sentence Transformer model. As training loss, we use a Softmax Classifier.Next, we also specify a dev-set. The dev-set is used to evaluate the sentence embedding model on some unseen data. Note, the dev-set can be any data, in this case, we evaluate on the dev-set of the STS benchmark dataset. The
evaluator
computes the performance metric, in this case, the cosine-similarity between sentence embeddings are computed and the Spearman-correlation to the gold scores is computed.The training then looks like this:
Continue Training on Other Data
training_stsbenchmark_continue_training.py shows an example where training on a fine-tuned model is continued. In that example, we use a sentence transformer model that was first fine-tuned on the NLI dataset and then continue training on the training data from the STS benchmark.
First, we load a pre-trained model from the server:
The next steps are as before. We specify training and dev data:
In that example, we use CosineSimilarityLoss, which computes the cosine similarity between two sentences and compares this score with a provided gold similarity score.
Then we can train as before:
Loading SentenceTransformer Models
Loading trained models is easy. You can specify a path:
Note: It is important that a / or \ is the path, otherwise, it is not recognized as a path.
You can also host the training output on a server and download it:
With the first call, the model is downloaded and stored in the local torch cache-folder (
~/.cache/torch/sentence_transformers
). In order to work, you must zip all files and subfolders of your model.We also provide several pre-trained models, that can be loaded by just passing a name:
This downloads the
bert-base-nli-mean-tokens
from our server and stores it locally.Loading custom BERT models
If you have fine-tuned BERT (or similar models) and you want to use it to generate sentence embeddings, you must construct an appropriate sentence transformer model from it. This is possible by using this code:
Training Multilingual Sentence Embeddings Models
We provide code and example to easily train sentence embedding models for various languages and also port existent sentence embedding models to new languages. For details, see multilingual-models.md and our publication Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation.
Pretrained Models
We provide the following models. You can use them in the following way:
English Pre-Trained Models
In the following you find selected models that were trained on English data only. For the full list of available models, see SentenceTransformer Pretrained Models. See the next section for multi-lingual models.
Trained on NLI data
These models were trained on SNLI and MultiNLI dataset to create universal sentence embeddings. For more details, see: nli-models.md.
Trained on STS data
These models were first fine-tuned on the AllNLI datasent, then on train set of STS benchmark. They are specifically well suited for semantic textual similarity. For more details, see: sts-models.md.
Multilingual Models
The following models can be used for languages other than English. The vector spaces for the included languages are aligned, i.e., two sentences are mapped to the same point in vector space independent of the language. The models can be used for cross-lingual tasks. For more details see multilingual-models.md.
Performance
Extensive evaluation is currently undergoing, but here we provide some preliminary results.
Loss Functions
We implemented various loss-functions that allow training of sentence embeddings from various datasets. These loss-functions are in the package
sentence_transformers.losses
.Models
This framework implements various modules, that can be used sequentially to map a sentence to a sentence embedding. The different modules can be found in the package
sentence_transformers.models
. Each pipeline consists of the following modules.Word Embeddings: These models map tokens to token embeddings.
Embedding Transformations: These models transform token embeddings in some way
Suggested labels
{'label-name': 'training-framework', 'label-description': 'Frameworks and tools for training and fine-tuning sentence embeddings for specific tasks.', 'confidence': 58.47}