huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
129.7k stars 25.76k forks source link

How to use BERT for finding similar sentences or similar news? #876

Closed Raghavendra15 closed 3 years ago

Raghavendra15 commented 5 years ago

I have used BERT NextSentencePredictor to find similar sentences or similar news, However, It's super slow. Even on Tesla V100 which is the fastest GPU till now. It takes around 10secs for a query title with around 3,000 articles. Is there a way to use BERT better for finding similar sentences or similar news given a corpus of news articles?

nreimers commented 5 years ago

Hi, BERT out-of-the-box is not the best option for this task, as the run-time in your setup scales with the number of sentences in your corpus. I.e., if you have 10,000 sentences/articles in your corpus, you need to classify 10k pairs with BERT, which is rather slow.

A better option is to generate sentence embeddings: Every sentence / article is mapped to a fixed sized vector. You need to map your 3k articles only once to a vector.

A new query is then also mapped to a vector. In this setup, you only need to run BERT for one sentence (at inference), independent how large your corpus is.

Then, you can use cosine-similiarity, or manhatten / euclidean distance to find sentence embeddings that are closest = that are the most similar.

I released today a framework which uses pytorch-transformers for exactly that purpose: https://github.com/UKPLab/sentence-transformers

I also uploaded an example for semantic search, where each sentence in a corpus is mapped to a vector and than cosine-similarity is used to find the most similar sentences / vectors: https://github.com/UKPLab/sentence-transformers/blob/master/examples/application_semantic_search.py

Let me know if you have further questions.

stefan-it commented 5 years ago

I think you can use faiss for storing and finding similar embeddings.

Raghavendra15 commented 5 years ago

@nreimers Amazing!! Thank you so much. What you created is a real-life savior! Can this be used for finding similar news(given title and abstract)? I ran the code and I have the following doubts. Which model should I use? bert-large-nli-stsb-mean-tokens vs bert-base-nli-mean-tokens vs bert-large-nli-mean-tokens (what are the datasets on which all these models are trained on?)

Can I use faiss to compute the search/distance of the vectors instead of L2/Manhattan/Cosine distances?

Many thanks to @stefan-it for introducing me to faiss.

Raghavendra15 commented 5 years ago

@nreimers I don't think scipy.spatial.distance.cdist is good enough, it takes a lot of time to compute the results, almost 10 minutes on a corpus of 3.9k news articles. I think I should try using faiss. I don't know anything about faiss but I will try.

nreimers commented 5 years ago

Hi @Raghavendra15, regarding the model I sadly cannot be helpful, you would need to test them. In general, sentence embeddings methods (like Inference, Universal Sentence Encoder or my git) work well for short text, i.e., for sentences. For longer text with multiple sentences their performance often decrease and average word embeddings or tf-idf is in many case a much better choice. For longer texts, all these sentence embeddings methods are not really needed.

It would be great if you have some training data. Then, it would be quite easy to fine-tune a model specifically for your task. It should achieve much better performances than the pre-trained models.

I think the issue is not scipy.spatial.distance.cdist. On a corpus with 100k embeddings and 1024 embedding size, it requires about 0.2 seconds per query (if you can batch queries, even less time is needed).

I think the issue might be the generation of the 4k sentence embeddings? Transformer networks like BERT are extremely slow on CPUs. However, on a GPU, the implementation can process about 2000 sentences per seconds. On a GPU, only about 40 sentences.

But the corpus must only be processed once and can then be stored & loaded from disk. At inference, you just need to generate one embedding for the respective query.

You can of course combine this with faiss. Faiss generates index structures that allow a quick search in vector space and is especially suitable if you have a high number (millions) of vectors. For 4k vectors, scipy takes about 0.008 seconds per queries to find the most similar vectors.

So either something is really strange with scipy on your computer, or the long run-time comes from the generation of the embeddings.

Raghavendra15 commented 4 years ago

@nreimers Thank you very much for your response. You're absolutely right, most of the time taken is for generating the embedding for 4k sentences. I'm now confused between choosing this model over XLNet, XLNet has achieved the state of the art results.

By your comments on faiss, As long as I have a smaller dataset, results from faiss and scipy won't make any difference? However, If I had millions or billions of news articles then using faiss makes sense right? For smaller datasets, there is no difference in terms of quality of matches between faiss and scipy(the results are the same for computing the distances)?

I have one important question, If I want to train the model as you suggested which would yield better results, In that case, I should have labeled dataset right? However, for news, I only have titles and abstract about that news. Is there a way to train them without the labels?

nreimers commented 4 years ago

Hi, XLNet achieved state-of-the-art performance for supervised tasks like classification. But it is unclear if it generates also good embeddings for unsupervised tasks.

In the framework you can choose XLNet, but I was only able to produce results that are slightly below those of BERT.

Others also have problems getting a good performance with Xlnet for supervised tasks, as it appears that it is extremely sensitive to the hyper parameters.

If you have millions of docs, faiss makes sense. With scipy, you get exact scores. With faiss, the scores are fuzzy and the returned most similar vectors must not necessarily be the actual most similar vectors. There can be small variations. But I think the difference will be small.

Often you have in your data some structure, like categories or links between news articles. This structure can be used to fine-tune a model. Let's say you have links linking to similar events. Than you train the network with triplet loss with the two linked articles and one random other article as negative example.

This will give you a vector space where (possibly) linked articles are close.

Raghavendra15 commented 4 years ago

@nreimers Thank you very much for your quick response. Are the existing model "bert-large-nli-stsb-mean-tokens" better than the google news word2vec google_news_300, they claim that-" We are publishing pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases." Is the pretrained "bert-large-nli-stsb-mean-tokens" better than google's pre-trained news vectors?

For training the existing model to improve results for news similarity, the problem I have is I can't create a dataset to compute triplet loss. For triplet loss to work in the case of news similarity for query news ['a'], I need to find a news article ['b'] which is similar as a positive example and a dissimilar news article ['c'] as a negative example. Like <a,b> positive example and <a,c> negative example.

However, If I run the news every day then, new entities/topics are going to pop up every single day? I need to update my embeddings right? I don't know how to handle this situation.

nreimers commented 4 years ago

Google News vectors are just word vectors, you still need a strategy to derive sentence embeddings from these. But as mentioned earlier, average word embeddings is a promising idea for your task. Note, average word embeddings models will be added soon to the repository.

Constantly updating of the model is not needed. News are changing, but the used words remain the same. So training once should give you a model that can be used for a long time.

Raghavendra15 commented 4 years ago

@nreimers Thank you very much! Any tentative date by when the average word embeddings will be added to the repository?

I want to know how to evaluate the results of similar sentences numerically, for example when I use your model to evaluate for a given news, finding similar news in the corpus.

Is there a way to measure numerically how good the similar sentences are in the below example? I used BLEU score, but the problem is, it's not an accurate measure of similarity. BLEU score doesn't consider the context of the sentence, it just blindly counts whether a word in the query sentence is present in the similar sentence regardless of where the word is placed.

For an item, I get related items. In the below example, the first title in relatedItems is similar, however, the second item in "relatedItems" is not at all similar which talks about Stephen Colbert and Joe Biden. Suppose I use word2vec model for the above task it might give me two totally different sentences as relatedItems, In that case, how can I evaluate both the models and claim numerically which one is better?

Example:

{"title": "Google Is Rolling Out A New Version Of Android Auto - Here's What You Can Expect", "abstract": "The new Android Auto. Google If you use Android Auto, you're about to receive to a nice upgrade.", } "relatedItems": [{

"title": "New Android ransomware is spreading through text messages", "abstract": "There\u2019s a new type of Android ransomware making the rounds that leverages SMS to spread, according to a new report from cyberappsecurity com", }, { "title": "Stephen Colbert Brings Curtain Down On Democratic Debates With Joe Biden Tweaks", "abstract": "Stephen Colbert closed his second of two live Late Show monologues with a spree of zingers directed at Joe Biden, mixing in plenty for the o",

} ]}

nreimers commented 4 years ago

Bleu wouldn't be a good measure, because the best similarity metric to find similar news would be: Bleu (of course).

What you would need is an annotated Corpus. For a given article, get for example the 20 articles with the highest tf idf similarity. Then annotate every pair as similar or not.

With this data you can compare different methods with Ndcg about how well they rank the 20 candidate articles.

Avg. Word embeddings should be included within the next two weeks to the repo.

Raghavendra15 commented 4 years ago

@nreimers When you say -"Bleu wouldn't be a good measure, because the best similarity metric to find similar news would be: Bleu (of course)." Do you mean when I get similar news like in the above example, BLEU is the best metric to measure how similar the two news articles are? Please correct me if I understood this wrong.

In the STS benchmark, I saw a pair in the training dataset with gold-standard human evaluated scores. The following paid had a score of 5, however, when I use BLEU scores for 1gram they don't get a score of 1. Instead, they get the following scores. BLEU looks for the exact word to be present in the reference sentence that's the problem I feel. There's no notion of similarity.

s=word_tokenize("The polar bear is sliding on the snow") reference = [s] candidate =word_tokenize("The polar bear is sliding across the snow") print('Individual 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))) print('Individual 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 1, 0, 0))) print('Individual 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 1, 0))) print('Individual 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 0, 1)))

Individual 1-gram: 0.875000 Individual 2-gram: 0.714286 Individual 3-gram: 0.500000 Individual 4-gram: 0.400000

reference sentence has 8 words out of which the candidate matches exactly 7 words, so 7/8 score for 1-gram matches.

I'm not sure how the STS benchmarks are evaluated, I'm currently looking into them. If you have any leads or a document I would be more than happy to read them.

Thank you very much for your help :)

nreimers commented 4 years ago

No, BLEU is a terrible idea for evaluation.

STS is usually evaluated using Pearson correlation between gold and predicted labels. But Pearson correlation is also a bad idea: https://aclweb.org/anthology/C16-1009

I strongly recommend to use Spearman correlation for comparison.

Raghavendra15 commented 4 years ago

@nreimers Kudos on the COLING paper! It's very well written. In the paper, you have mentioned How Pearson correlation can be misleading or ill-suited for the semantic text-similarity task. However, you did not suggest to use Spearman correlation instead of Pearson correlation. But for me, you suggested me to use Spearman correlation why? (That's my current understanding of the paper)

Can I use the Spearman rank correlation from scipy? Basically, I want to compare the BERT output sentences from your model and output from word2vec to see which one gives better output. So there is a reference sentence and I get a bunch of similar sentences as I mentioned in the previous example [ please refer to the JSON output in the previous comments].

Will the below code is the right way to do the comparison? In your sentence transformer, you have used the same below package in SentenceEvaluator class. I couldn't figure out how to use that class for my comparison.

Will you please give me some idea in this regard?

Example code: from scipy.stats import spearmanr x = [1, 2, 3] ---> I will use BERT and word2vec embeddings here. x_corr = [2, 4, 6] corr, p_value = spearmanr(x, x_corr) print (corr)

nreimers commented 4 years ago

Hi @Raghavendra15 The issue with pearson correlation is, that it assumes a linear correlation between the system output and gold labels. Adding a montone function to the system output can change the scores (make them better or worse), which does not really make sense in applications.

Assume you have a systems that predicts the perfect gold scores, however, the output is output=sqrt(gold_label).

This system would get a really low Pearson correlation. However, for every application, this system would be perfect, as it predicts the gold labels. With Spearman correlation, you don't have this issue. There, just the ranking of the scores are important.

In general I think the STS tasks (or the STS benchmark) are not really well suited to evaluated approaches. The STS tasks with Pearson/Spearman correlation weights every score similar, but in applications, we are often only interested in certain examples.

For example, if we search for pairs with the highest similarity, then we don't care how the scores are for low similarity pairs. A system that gives a perfect score for high similarity pairs and a random score for low similarity pairs would be great for this application. However, this system would get a low Pearson/Spearman correlation, as it fails to correctly order the somewhat-similar and unsimilar pairs.

If you want so estimate the similarity of two vectors, you should use cosine-similarity or Manhatten/Euclidean distance.

Spearman correlation is only used for the comparison to gold scores.

Assume you have the pairs: x_1, y_1 x_2, y_2

... for every (x_i, y_i) you have a score s_i from 0 ... 1 indicating a gold label score for their similarity.

You can check how good the embeddings are by computing the cosine similarity between the embeddings for (x_i, y_i) and then you compute the Spearman correlation between these computes cosine similarity scores and the gold score s_i.

Note: Currently I add methods to compute average word embeddings and similar methods to the repository. So a comparison will become easier.

Raghavendra15 commented 4 years ago

@nreimers Last week you added the methods to compute average word embeddings should I use that method when I get a sentence embedding or will there be a pre-trained average word embedding weights? In the below code I will get the embeddings once I pass the input strings. Should I use the compute avg word embedding method on top of this?

corpus = ['A man is eating a food.', 'A man is eating a piece of bread.' ] corpus_embeddings = embedder.encode(corpus) or By any chance, pre-trained avg-word embedding weights will be uploaded to the repository by any time this week.

nreimers commented 4 years ago

Hi @Raghavendra15 I just uploaded v0.2.0 to github and PyPi: https://github.com/UKPLab/sentence-transformers

You can update with pip install -U sentence-transformers

I added an example for average word embeddings (+a DAN layer that is trainable): https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_stsbenchmark_avg_word_embeddings.py

You can also use it without the DAN layer. There is also a tokenizer implemented that allows the usage of the word2vec Google News vectors. These vectors contain phrases like 'New_York'. These phrases are detected by the tokenizer and mapped to the correct embedding for New_York. But there is currently no example for this in the repo. If you need help, let me know.

To get avg. word embeddings only (without DAN), the code must look like this:

# Map tokens to traditional word embeddings like GloVe
word_embedding_model = models.WordEmbeddings.from_text_file('glove.6B.300d.txt.gz')

# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
                               pooling_mode_mean_tokens=True,
                               pooling_mode_cls_token=False,
                               pooling_mode_max_tokens=False)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

corpus_embeddings = model.encode(corpus)

Next release will update include support for RoBERTa and add other sentence embeddings methods (like USE, LASER), which will be trainable.

Raghavendra15 commented 4 years ago

@nreimers Thank you very much! You spoke my mind with RoBERTa, I was about to ask you about it. But with the avg-embedding approach, I won't be using BERT at all right?

In addition to that, I won't be training the model. I don't think I fully understand this. Earlier I would pass a pretrained weight model into SentenceTransformer, however now I won't pass anything related to BERT, does that mean I won't be using BERT?

nreimers commented 4 years ago

@Raghavendra15 The framework offers you a lot of flexibility. You can choose between the following embedding approaches:

Then, you can choose between different pooling modes: Mean pooling, max pooling, usage of the CLS token for BERT / XLNet.

Finally, if you like, you can add feed-forward networks to create a deep-averaging network.

If you have training data, I can recommend this combination: BERT + mean-pooling

This gave the best performance for many cases.

If you have training data, you need a low computation time and performance is not that important, choose this combination: GloVe embeddings (or something similar) + mean-pooling + 1 or 2 dense layers

If you don't have training data, choose: GloVe embeddings (or something similar) + mean-pooling

As you can see, there are various options you can choose from, depending if you have training data and how important is a high speed vs. a good performance.

Once I have RoBERTa integrated, how suitable it is for the generation of sentence embeddings. My experiences with XLNet was that the performance is slightly below the performance of BERT for sentence embeddings. Maybe RoBERTa is better for sentence embeddings, maybe not.

Averaging BERT without fine-tuning on data gave really poor results. However, what you can of course try, is to use one of the existent pretrained BERT models like 'bert-base-nli-mean-tokens', which is BERT+mean-pooling, fine-tuned on NLI data to generate meaningful sentence embeddings.

Raghavendra15 commented 4 years ago

@nreimers Thank you very much! Why didn't you choose (word2vec) Google news vectors? Is there any particular reason for choosing Glove embedding over word2vec? I'm curious to know how RoBERTa will perform! 😃

nreimers commented 4 years ago

@Raghavendra15 There are two reasons: 1) Google news word2vec is quite large, it requires about 12 GB of RAM to read it in. Not that ideal for an example script. GloVe embeddings are about 10 times smaller. 2) In most of my experiments, the Google news word2vec vectors did not yield good performances. GloVe embeddings were often a bit better. I especially like the embeddings by Levy et al (trained on dependencies) and by Komninos. I also conducted a larger comparison between word embeddings (https://arxiv.org/abs/1707.06799, Table 5).

But note, using the Google news word2vec vectors is quite easy. In training_stsbenchmark_avg_word_embeddings.py replace

word_embedding_model = models.WordEmbeddings.from_text_file('glove.6B.300d.txt.gz')

with

word_embedding_model = models.WordEmbeddings.from_text_file('GoogleNews-vectors-negative300.txt.gz')

First experiments with RoBERTa are done: On STSbenchmark, it increases the Spearman correlation by about 1 - 2 percentage points. I will see how it will perform on other datasets.

Best, Nils Reimers

thomwolf commented 4 years ago

This issue is very interesting, thanks for sharing your experiments and framework @nreimers!

Raghavendra15 commented 4 years ago

@nreimers I read your paper on word embedding comparison, however, when I saw the GLEU scoreboard for STS benchmark Glove scored very less compared to word2vec, Isn't it contradictory to your paper? Also in your paper, the comparisons are on a certain set of tasks like Entity Recognition, NER but not on Semantic Textual Similarity. I don't know much about it, I'm trying to learn. Do my questions make sense?

Is there any significant difference between using glove.840B.300d.zip (contains 840 billion words vectors trained on the common crawl ) vs glove.6B.300d.txt.gz (contains 6 billion words vectors wikipedia+Gigaword), Is it like more words the better? also, they're trained on different datasets, will that make a huge difference when applied to news similarity?

nreimers commented 4 years ago

See the GloVe website / paper for the differences. 6B was trained on 6 billion words from Wikipedia, 840B was trained on 840 Billion words from common crawl.

It depends on the task and data which one is more suitable. If you have a lot of rare words, and those play an important role for your task, 840B is often better. If you have clean data / only common words are important for your task, 6B often works better.

However, the differences are often only minor between the two versions.

In my paper I only compare embeddings for supervised task, only for sequence tagging.

In unsupervised tasks, you can get completely different results. Further, how word embeddings are averaged has a big impact. Some authors don't ignore stop words, instead they propose some complicated weighting scheme. If stop words are ignored, performances can be improved up tp 10 percentage points, sometimes outperforming complex weighting approaches.

Best, Nils Reimers

ghost commented 4 years ago

Thank you for your work, Nils, it is brillant!

I would like to design a sentence level semantic search engine using email data (Enron dataset).

I am still a little bit confused about how I should be fine-tuning models on such dataset (maybe I am missing something obvious).

Thanks.

Gogan

nreimers commented 4 years ago

@ggndtes In general BM25 will be really hard to beat on this type of task. See this paper where they compare sentence embeddings with BM25 on an end-to-end retrieval task (given: question, find similar / duplicate questions in a large corpus): https://arxiv.org/pdf/1811.08008.pdf

A complex sentence embedding method only achieves 1 - 2 percentage points improvement against BM25 (Table 2, Dual Encoder Paralex vs. Okapi BM 25).

Especially if you have more than just a sentence, carefully constructed BM25 for example with Elasticsearch is really really hard to beat. If you are interested in a production system, I would highly recommend to first try Elasticsearch (or similar), beating it will be difficult.

Back to your question how you can tune it: The big question narrows down to: What are your queries, what are your documents. Are your documents complete emails? Or only email subjects? Or only sentences within emails?

Are your queries inputs from the user, email subjects or complete emails?

In general you would need to construct same sort of similarity. Currently I can only think of imperfect method to create similarity labels. One option would be: Triplet loss with 2 emails from the same inbox vs. one random other subject. But this would I think create rather bad embeddings.

Currently I can't think of a good method to create similarity labels for that dataset. And as mention, even with perfect labels, it will be really hard to beat BM25.

Best, -Nils Reimers

Raghavendra15 commented 4 years ago

@nreimers The sentence encoder actually takes quite a lot of time to load the Glove embeddings, Is there a way where I can make it load from the disk or make it faster?

nreimers commented 4 years ago

@Raghavendra15 When you run the code the first time, the embeddings are downloaded and stored in the path of the script. In follow-up executions, the embeddings file is loaded from disk.

GloVe embeddings are quite large, so loading it can take some time.

There are two ways to speed it up: 1) Limit the vocab size, i.e., don't load all the ~400k embeddings. Pass the parameter 'max_vocab_size' to the method 'from_text_file' when called. 2) Save the WordEmbeddings model to disc. In follow-up executions, you can load the (binary) model directly from disc and you don't have to read in and parse in the text file.

Should work something like this:

word_model = WordEmbeddings.from_text_file('my-glove-file.txt')
word_model.save('my/output/folder/GloveWordModel')

# In follow-up calls, should be faster
word_model = WordEmbeddings.load('my/output/folder/GloveWordModel')
Raghavendra15 commented 4 years ago

@nreimers Wow!! It works blazingly fast! I was trying to play with the below code. Thank you very much for the help :) Code in In WordEmbeddings.py file:

 with gzip.open(embeddings_file_path, "rt", encoding="utf8") if embeddings_file_path.endswith('.gz') else open(embeddings_file_path, encoding="utf8") as fIn:
            iterator = tqdm(fIn, desc="Load Word Embeddings", unit="Embeddings")
            for line in iterator:

Also, can I load the model similar to that for BERT pre-trained weights? such as the below code?

embedder = SentenceTransformer('bert-large-nli-stsb-mean-tokens') Can I load the above pre-trained weights somehow just like you have load method for glove weights?

Is the avg embedding with Glove better than "bert-large-nli-stsb-mean-tokens" the BERT pre-trained model you have loaded in the repository? How's RoBERTa doing? Your work is amazing! Thank you so much again!

nreimers commented 4 years ago

@Raghavendra15 Sure you can:

word_embedding_model = models.WordEmbeddings.from_text_file('glove.6B.300d.txt.gz')

# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
                               pooling_mode_mean_tokens=True,
                               pooling_mode_cls_token=False,
                               pooling_mode_max_tokens=False)

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
model.save('my/output/folder/avg-glove-embeddings')

# Load the Model:
model = SentenceTransformer('my/output/folder/avg-glove-embeddings')

Which model is better depends extremely on your data and on your task. The BERT models work good if you have clean data, which is not too domain specific and rather descriptive. This is due to the nature on which data it was fine-tuned (on NLI dataset).

Average GloVe embeddings works I think better if you have noisy data, really domain specific data or very short sentences or very large paragraphs.

Experiments with RoBERTa are finished. Paper will be uploaded next week to arxiv. In my experiments, I could not observe a major difference between BERT and RoBERTa for sentence embeddings: Sometimes BERT is a little bit better, sometimes RoBERTa. But nothing that is significant. XLNet was so far in general worse than BERT.

Best -Nils Reimers

Raghavendra15 commented 4 years ago

@nreimers Thanks! but my question is how can make the pretrained BERT model faster like loading the below model. embedder = SentenceTransformer('bert-large-nli-stsb-mean-tokens') When I run the encoder for BERT it takes a lot of time like 10-15 minutes for 4k sentences. embedder.encode(corpus) ---> This takes around 10minutes for "bert-large-nli-stsb-mean-tokens" However, Glove model does the job in 30 secs. bert-large-nli-stsb-mean-tokens is similar to glove pretrained word vectors right? Then is there a way to convert speed up the BERT sentence encoder?

nreimers commented 4 years ago

@Raghavendra15 No, the BERT model and average GloVe embeddings are completely different.

GloVe embeddings have one vector for each word in a language, for example, the word 'apple' is mapped to the vector 0.31 0.42 0.15 ....

To compute avg. GloVe embeddings, you just perform some memory lookup: Every word is mapped to the vector and then you compute the mean values.

BERT (https://arxiv.org/abs/1810.04805) is much more complex: Words in a sentence are first broken down to subwords, which are than mapped to vectors (which is the fast part).

But after that, a transformer network is run over the complete sentence: For BERT-base it has 12 layers, for BERT-large, it has 24 layers. This produces vectors for each word which depend on the context of the complete sentence.

If you have the two sentences:

With GloVe, Apple are mapped in both cases to the same vector. With BERT, the two Apple words are mapped to different embeddings. In the first case, it is mapped closer to words like Banana, Mango etc., in the second sentence, it is mapped closer to words like Microsoft, Google etc.

But this comes with a cost: Transformer networks are rather slow. This is especially true if you have only a CPU or an older GPU.

On a CPU, you can process with BERT about 80 sentences / second (with GloVe, more than 5k). On a Nividia V100 GPU, the speed is a bit better: About 2000 sentences / second (BERT-base).

The runtime for transformer networks is quadratic with the sentence length. If your sentence is twice as long, the runtime increases 4x.

So the only ways to speed-up the BERT model:

I hope this of some help for you.

Best regards -Nils Reimers

julien-c commented 4 years ago

This is an outstanding explanation Nils – you should blog or tweet, I'm sure lots of people would be interested in reading more from you!

Raghavendra15 commented 4 years ago

@nreimers Brilliant explanation! :D You're a life saviour :-) I need your help with this issue. Can I use sentence transformer for this case? https://github.com/huggingface/pytorch-transformers/issues/1170

xiao2mo commented 4 years ago

@nreimers Very patient brilliant explanation. Wish u a happy life.

pertschuk commented 4 years ago

@nreimers Let's say you have a sufficient training set for information retrieval, such as that from fever.ai.

We used black-box Bayesian Optimization to train BM25 on Elasticsearch... producing close to the results described in the SOTA evidence retrieval from the UKP-Athene team, but were still a few % off SOTA, without entity extraction or any other ML preprocessing.

Shouldn't it be the case that a well trained encoder transformer with cosine-loss, with specific weights for a query and document / sentence in the result set, should be able to beat an arbitrary algorithm like BM25?

And that it could be deployed at scale using faiss or hsnw?

nreimers commented 4 years ago

Hi @pertschuk If the recall of BM25 is quite good, I would aim for re-ranking instead of a full semantic search.

In re-ranking, you retrieve e.g. 100 documents with your BM25 algorithm. Then, you run BERT to compare each document with your query to get one score (0...1). Next, you sort these scores.

Your original ranking from BM25 is then replaced with these BERT-based scores ranking.

Sentence embeddings often have challenges in information retrieval as the false positive probability is higher than BM25. I.e., if you compare two dissimilar sentences with sentence embeddings, the probability of getting a high similarity score is higher for approaches like Sentence-BERT / InferSent / USE, than it is for BM25.

In Information Retrieval, you usually have a large set of unrelated docs, i.e., this higher false positive rate leads to really bad consequences that you find many unrelated documents, leading to a performance usually lower than BM25.

The re-ranking approach prevents this to happen: BM25 gives you a rather clean candidate set, and your neural re-ranking approach can then do the hard work and determine which of the n documents matches the query the best.

Best regards Nils Reimers

pertschuk commented 4 years ago

Great thank you, this makes sense.

We are currently using reranking on top 9 documents, but maybe could increase this number since our reranking recall is quite high ~.95 based on a RoBERTA regression model.

link: https://github.com/koursaros-ai/koursaros/blob/master/examples/pipelines/factchecking/services/scorer/__main__.py

I guess then the challenge becomes the scale of re-ranking, because there would be ~700 sentences to rerank with this larger set, and we can maybe run 100/s on SoTA transformer.

I wrote a FEVER dataset loader and am currently training a sentence reranking model based on your cosine loss, I am hoping to achieve the greater performance afforded by precomputing embeddings and running KNN to rerank, I will publish results here when I have them.

nreimers commented 4 years ago

Yes, larger candidate sets can actually be quite interesting.

What you can also try is the faster, destilled BERT from hugging face. It achieves similar results like BERT, but is faster.

Sometimes, a larger set with worse (but cheaper) models achieves better overall results than a small set with a better (but expensive) model.

Best -Nils Reimers

duttsh commented 4 years ago

Hi, BERT out-of-the-box is not the best option for this task, as the run-time in your setup scales with the number of sentences in your corpus. I.e., if you have 10,000 sentences/articles in your corpus, you need to classify 10k pairs with BERT, which is rather slow.

A better option is to generate sentence embeddings: Every sentence / article is mapped to a fixed sized vector. You need to map your 3k articles only once to a vector.

A new query is then also mapped to a vector. In this setup, you only need to run BERT for one sentence (at inference), independent how large your corpus is.

Then, you can use cosine-similiarity, or manhatten / euclidean distance to find sentence embeddings that are closest = that are the most similar.

I released today a framework which uses pytorch-transformers for exactly that purpose: https://github.com/UKPLab/sentence-transformers

I also uploaded an example for semantic search, where each sentence in a corpus is mapped to a vector and than cosine-similarity is used to find the most similar sentences / vectors: https://github.com/UKPLab/sentence-transformers/blob/master/examples/application_semantic_search.py

Let me know if you have further questions.

Hi, Can t

Hi, BERT out-of-the-box is not the best option for this task, as the run-time in your setup scales with the number of sentences in your corpus. I.e., if you have 10,000 sentences/articles in your corpus, you need to classify 10k pairs with BERT, which is rather slow.

A better option is to generate sentence embeddings: Every sentence / article is mapped to a fixed sized vector. You need to map your 3k articles only once to a vector.

A new query is then also mapped to a vector. In this setup, you only need to run BERT for one sentence (at inference), independent how large your corpus is.

Then, you can use cosine-similiarity, or manhatten / euclidean distance to find sentence embeddings that are closest = that are the most similar.

I released today a framework which uses pytorch-transformers for exactly that purpose: https://github.com/UKPLab/sentence-transformers

I also uploaded an example for semantic search, where each sentence in a corpus is mapped to a vector and than cosine-similarity is used to find the most similar sentences / vectors: https://github.com/UKPLab/sentence-transformers/blob/master/examples/application_semantic_search.py

Let me know if you have further questions.

Can this use GPUs, if so how ?

nreimers commented 4 years ago

Hi @duttsh, Yes, GPU is supported out of the box. You just need the necessary Cuda drivers and then you can train / perform inference on the GPU without any changes

Best regards Nils Reimers

duttsh commented 4 years ago

Thanks Nils

Sent from my iPhone

On Oct 6, 2019, at 1:43 AM, Nils Reimers notifications@github.com wrote:

Hi @duttsh, Yes, GPU is supported out of the box. You just need the necessary Cuda drivers and then you can train / perform inference on the GPU without any changes

Best regards Nils Reimers

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

duttsh commented 4 years ago

@nreimers (Nils) one more question, when will you have pre-trained models of RoBERTa available ? or if they are please send me the name.

nreimers commented 4 years ago

@duttsh I can try to upload it, but in my experimemts I didn't see any improvements from Roberta for sentence embeddings.

Best regards Nils Reimers

duttsh commented 4 years ago

Thanks, can you please upload. Also I believe Roberta will increase the accuracy of inference. Right ?

Sent from my iPhone

On Oct 6, 2019, at 7:18 PM, Nils Reimers notifications@github.com wrote:

@duttsh I can try to upload it, but in my experimemts I didn't see any improvements from Roberta for sentence embeddings.

Best regards Nils Reimers

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

nreimers commented 4 years ago

@duttsh In my experiment, I didn't observe any differences between BERT and RoBERTa when used for different sentence embeddings tasks.

duttsh commented 4 years ago

@nreimers thanks. If you could share he name if your RoBERTa model, would be great.

pertschuk commented 4 years ago

After a couple months of research, the best approach I've found for building semantic search is to integrate with an existing BM25 search platform such as Elasticsearch, and then rerank the top n results using a neural network regression trained to score a combination, on a dataset such as MS MARCO.

Per @nreimers comment, something like BM25 produces a cleaner result set, and training a model to look at query passage pairs at the same time rather than training a cosine loss and comparing precomputed vectors enables it to use attention to more accurately rank passages.

Check out this project that implements such a system: https://github.com/koursaros-ai/nboost

ghost commented 4 years ago

Took me a long time to reply but thanks so much @nreimers for your incredibly clear explanations and responses.

Thanks also to @pertschuk for sharing the results of your research, this is very helpful.

wolf-tag commented 4 years ago

Hi, I'm looking into the question of finding prior art for patents. This means for one patent application (around 20pages) we would like to find the closest 100 patents in a corpus of 100 million patents. The search results of patent offices could be used as training material. We thought about tf-idf, word2vec, GloVe etc. So far transformers like BERT seemed to be too slow for such a task. Now with SBERT and SRoBERTa and powerful AI accelerators, we ask ourselves, if we shouldn't be so quick to exclude transformers. Any advice? Has anyone applied SBERT to such an amount of data? Anyone using AI accelerators such as Jetson?