huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.7k stars 26.22k forks source link

How to use BERT for finding similar sentences or similar news? #876

Closed Raghavendra15 closed 3 years ago

Raghavendra15 commented 5 years ago

I have used BERT NextSentencePredictor to find similar sentences or similar news, However, It's super slow. Even on Tesla V100 which is the fastest GPU till now. It takes around 10secs for a query title with around 3,000 articles. Is there a way to use BERT better for finding similar sentences or similar news given a corpus of news articles?

pertschuk commented 4 years ago

@wolf-tag The two biggest issues with my research into building a transformer cosine loss solution based on SBert at scale (I was working with ~6million wikipedia articles, much smaller than all of the patents).

  1. Evaluating the solution. To rebuild all of the 6 million vectors and put them into a FAISS index takes like (200/s to encode) takes like 6-8 hours, and then more time to actually query your test set and calculate something like MRR. Building a good model often requires dozens of evals, tweaking, etc.
  2. Memory usage. There are various compression methods, but currently vector indexes are pretty memory-hungry. https://github.com/facebookresearch/faiss/wiki/Indexing-1G-vectors. This has possible solutions. It seems like probably to make this scalable, you would need much smaller embeddings than like 1024 for BERT large.

If you are well funded and have lots of GPU/ TPU and memory, it's feasible, and I would look at Patent-BERT, and incorporate that into sentence transformers.

One final thought to keep in mind - I have found that almost everything out there, patents included, have summaries (abstracts). At an even more micro scale, humans often tend to summarize a paragraph with the first sentence. You can leverage this to optimize your solutions, by choosing to look at the summary text instead of all of it.

wolf-tag commented 4 years ago

Thank you for your quick reply.

Did I get it right: you used one vector for each wikipedia article which is the result of SBERT's pooling operation?

nreimers commented 4 years ago

Hi @wolf-tag Personally I think tf-idf / BM25 is the best strategy for your task, due to various reasons.

First, it is important to differentiate between false positive and false negative rates: False positive: A non-similar pair of docs is judged as similar, even they are not similar. False negative: A similar pair of docs is judged as dissimilar.

TF-IDF/BM25 has a low false positive rate and a high false negative rate, i.e., if a pair is judged as similar, there is a high chance that they are actually similar.

Sentence embeddings methods (avg. GloVe embeddings, InferSent, USE, SBERT etc.) have a reverse characteristic: high false positive rates, low false negative rates. It seldom misses a similar pair, but a pair judged as similar must not necessarily be similar.

For Information Retrieval, you have an extrem imbalance. You have 1 search query and 100 Mio. documents, i.e., you perform 100 Mio pairwise comparisons.

Sentence embeddings with a high false positive rate will return many pairs where the embeddings think they are similar, but they are not. Your result set of 10 documents will be often completely garbage.

TF-IDF / BM25 might miss some relevant documents, but the 10 document you will find will be of high quality.

Second, in my experience, sentence embeddings methods work best for sentences. For (longer) documents, the results are often not that great. Here, word overlap (with tf-idf / BM25) is really hard to beat.

Third: In our experiments in Question Answering (given a question, find in Millions of answers on StackOverflow the correct one), TF-IDF / BM25 is extremely hard to beat. It often performs much better than sentence embeddings methods + it is much quicker.

So far our experiments with end-to-end representation learning for information retrieval rather failed.

What works quite good is a re-ranking approach: You use BM25 to retrieve the top 100 documents. Than, you take a neural approach like BERT to re-rank these 100 results and you present the top-10 results (the 10 results with the highest score according to the neural re-ranker) to the user. This often gives a nice boost to pure BM25 ranking, and the runtime is not too-bad, as you must only re-rank 100 documents.

Best regards Nils Reimers

wolf-tag commented 4 years ago

Dear Nils,

thank you for your detailed explanation.

Indeed, recent publications for the prior art task do hardly show any improvements when using word2vec, GloVe or doc2vec compared to tf-idf.

I was just curious as Google now uses BERT for the search engine, and I suppose they are more interested in high precision than in high recall, so somehow they seem to master the high false positive rates. Maybe they do so as recommended by you (re-ranking).

I hoped that one of the newer methods would somehow have a positive impact on this task. Just wishful thinking, I fear.

nreimers commented 4 years ago

Hi @wolf-tag an interesting paper could be this: https://arxiv.org/abs/1811.08008

In Table 2 you see, that BM25 outperforms untrained sentence embeddings methods like avg. word2vec. If you have a lot of training data, you can tune the dual encoder so that it performs better than BM25 for the tested task (finding similar questions).

However, the task of finding similar questions involves rather short documents (often only a sentence). For longer documents, I would guess that BM25 still outperforms sentence embeddings methods.

In the paper it would have been interesting to compare the methods also against neural re-ranking, to see if the trained end-to-end retrieval is better or worse than the BM25 + re-ranking approach.

Best regards Nils Reimers

wolf-tag commented 4 years ago

Thank you for the hint.

If TF-IDF / BM25 is still the best option for long documents, there seems to be a lot of room for improvement for future research, as this method does not use context, does not follow any semantic approach such as WSD, WordNet or synsets, does neither use trained models nor exploit available training data and does not use any language specific resources (e.g. stemming or noun phrase identification). Maybe some kind of challenge is needed to encourage research in this field.

Best regards,

Wolfgang

realsergii commented 4 years ago

Hey @nreimers deep thanks for all the info! (Hopefully) quick question: what would be the optimal setup to find similarities (and build a search engine) between objects defined by a combination of senses?

For instance, consider a DB: Object 1: "pizza", "street food", "Italian cuisine" Object 2: "khachapuri", "street food", "Georgian cuisine", "cheese", "bread"

And then, a query "cheesy street food".

I'm using USE + hnswlib now, works pretty good, but only if the query string is more than 1 word. The more words, the better.

nreimers commented 4 years ago

Hi @realsergii Not sure if USE is the best match for that task. From the given example, I would again think that you would get quite far with BM25 and for example Elasticsearch. Elasticsearch is quite great for indexing complex objects and search over them.

Of course you would need to tune the search a bit, e.g. that longer n-grams give higher scores, maybe combined with stemming / lemmatization of words.

Otherwise, for individual words, I think word embeddings (like Word2vec / GloVe) are quite great. Sentence embeddings often have difficulties to give a good representation for words or short phrases, as these systems were not trained for it.

Also this repo could be interesting, which combines Elasticsearch BM25 with BERT re-ranking: https://towardsdatascience.com/elasticsearch-meets-bert-building-search-engine-with-elasticsearch-and-bert-9e74bf5b4cf2

This could potentially also be combined with a simple average word embedding re-ranking approach.

I hope that helps.

Best Nils Reimers

realsergii commented 4 years ago

@nreimers thanks Nils! Just one more clarification - what would change if in my DB I replace each word/phrase with the 1st sentence of Wikipedia entry which is the closest to the respective word/phrase ? So in that case, would USE or SBERT be a good choice?

nreimers commented 4 years ago

Hi @realsergii Sounds a bit complicated and you would have several other issues (how to find the correct article, what about small spelling variances).

Word embeddings are quite strong on finding similar words. As the context is rather small, I don't see too much benefit from using a sentence embedding methods to disambiguate words. 'Cheese' in your context will most often refer to the food, and not to e.g. a company or a strategy in a computer game.

Best Nils Reimers

realsergii commented 4 years ago

Thanks @nreimers My idea is not just to find similar words/phrases, but to find similar senses. E.g. "welding" is similar to "joining", "building", in my understanding. In order to comprehend this, a machine needs to know what are all of those concepts, described in more basic words, right? One way to teach a machine is to create a vector from a sentence where the sense is described by Wikipedia (and thus, in more basic concepts). Other way is to just get a sense (as a vector) (by word/phrase) from a model trained on Wikipedia and other sources. This is my understanding.

Please suggest what sounds better.

nreimers commented 4 years ago

Hi @realsergii That is exactly what words embeddings are great for: to find similar words, e.g. welding is similar to joining / building.

Mapping words to Wikipedia definitions sounds unnecessary complicated and I doubt you get good results with this (compared to simple word embeddings). At the end, as you have a fixed word to Wikipedia article mapping, you will get a fixed word -> vector mapping. But it is much more complicated and the quality will be much lower.

I would train word2vec / GloVe on large amount of text from your domain and then you can use these word embeddings for comparing word similarities.

pohanchi commented 4 years ago

That’s why using word embedding. Because tf-did just find lexicon overlapping but not the similar semantic meaning

On Mon, Dec 23, 2019 at 03:58 Nils Reimers notifications@github.com wrote:

Hi @realsergii https://github.com/realsergii That is exactly what words embeddings are great for: to find similar words, e.g. welding is similar to joining / building.

Mapping words to Wikipedia definitions sounds unnecessary complicated and I doubt you get good results with this (compared to simple word embeddings). At the end, as you have a fixed word to Wikipedia article mapping, you will get a fixed word -> vector mapping. But it is much more complicated and the quality will be much lower.

I would train word2vec / GloVe on large amount of text from your domain and then you can use these word embeddings for comparing word similarities.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/876?email_source=notifications&email_token=AIEAE4GQI2RXKMS4HDVY2Y3QZ7BIBA5CNFSM4IGKGJT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHPYXRA#issuecomment-568298436, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIEAE4EKRBKQQRZYQTNZXKDQZ7BIBANCNFSM4IGKGJTQ .

realsergii commented 4 years ago

Thanks! So it feels like Elasticsearch 7.3+ with a bunch of dense_vectors from GloVe for all components of my objects (e.g. "khachapuri", "street food", "Georgian cuisine", "cheese", "bread") is the most proper data structure for my system (semantic search engine).

I don't even need MySQL (for text representation storage) + separate index based on e.g. Faiss (for vectors and index) and don't need to sync them. Everything can be inside Elasticsearch (however the query speed will be slower than Faiss but I can live with that for now).

@nreimers just want to clarify that I don't need to even look into BERT/USE etc direction, right?

nreimers commented 4 years ago

@realsergii If your queries and documents are only words or short phrases, I think there is no benefit from using BERT / USE. Sentence embeddings can be helpful when you have more text (at least a sentence) and when words can be used ambiguously (like the word apple).

pertschuk commented 4 years ago

@realsergii I mentioned it prior in this thread, but if you're using elastic search as your backend, check out NBoost, which acts as a proxy on top of ES and uses BERT to rerank the n top results.

We recently released TinyBERT distilled version of the base models which are about 10x faster (critical when it comes to search). See https://arxiv.org/abs/1909.10351 for distilling custom models by the same method.

mohammedayub44 commented 4 years ago

@Raghavendra15 @nreimers Did you end up trialing it out with Faiss ? What were the results. ? I have a similar use case where I have a domain dataset (about 100k english sentences) related to fires, I want to find synthetic multilingual sentences in different languages (arabic, italian, chinese etc.). My thought was to download the Wikipedia corpus (source) for each language and embed both wikipedia and my fire data and find synthetic sentences.

By following this example Semantic Similarity

This is where it hangs/very slow -

Is there something wrong with the steps I have taken, appreciate any help.

Cheers ! Ayub

mohammedayub44 commented 4 years ago

Update: I ran it with faiss library using FlatIndex (as it give the most accurate results). On a p2.xlarge instance it was amazingly fast- building and searching took only 30 mins. I could not compare the results to scipy's cdist but for a sample of 10,000. I saw than >90% of the results lie in the top 5 matches found by faiss distance.

peacej commented 4 years ago

@Raghavendra15 @mohammedayub44 FYI, two relevant papers that recently came out from Google and Microsoft:

waayadi commented 4 years ago

How to train BERT on LinkedIn pages?

ruthwik081 commented 4 years ago

Hi @wolf-tag Personally I think tf-idf / BM25 is the best strategy for your task, due to various reasons.

First, it is important to differentiate between false positive and false negative rates: False positive: A non-similar pair of docs is judged as similar, even they are not similar. False negative: A similar pair of docs is judged as dissimilar.

TF-IDF/BM25 has a low false positive rate and a high false negative rate, i.e., if a pair is judged as similar, there is a high chance that they are actually similar.

Sentence embeddings methods (avg. GloVe embeddings, InferSent, USE, SBERT etc.) have a reverse characteristic: high false positive rates, low false negative rates. It seldom misses a similar pair, but a pair judged as similar must not necessarily be similar.

For Information Retrieval, you have an extrem imbalance. You have 1 search query and 100 Mio. documents, i.e., you perform 100 Mio pairwise comparisons.

Sentence embeddings with a high false positive rate will return many pairs where the embeddings think they are similar, but they are not. Your result set of 10 documents will be often completely garbage.

TF-IDF / BM25 might miss some relevant documents, but the 10 document you will find will be of high quality.

Second, in my experience, sentence embeddings methods work best for sentences. For (longer) documents, the results are often not that great. Here, word overlap (with tf-idf / BM25) is really hard to beat.

Third: In our experiments in Question Answering (given a question, find in Millions of answers on StackOverflow the correct one), TF-IDF / BM25 is extremely hard to beat. It often performs much better than sentence embeddings methods + it is much quicker.

So far our experiments with end-to-end representation learning for information retrieval rather failed.

What works quite good is a re-ranking approach: You use BM25 to retrieve the top 100 documents. Than, you take a neural approach like BERT to re-rank these 100 results and you present the top-10 results (the 10 results with the highest score according to the neural re-ranker) to the user. This often gives a nice boost to pure BM25 ranking, and the runtime is not too-bad, as you must only re-rank 100 documents.

Best regards Nils Reimers

I am working on a use case where I need to get similar documents (2-3 pages average), when I upload a 1 page document. For me reducing false negatives is a priority, at the same time I don't want too many false positives. Can I first implement an embedding model to get let's say 200 similar documents and then apply TFIDF/BM25 to filter out irrelevant documents

woiza commented 4 years ago

I just recently started on NLP and "AI" and have been following this thread. Having a similar use case (less than 10k documents --> find similar documents and also do a multi-label classification) I am very interested in your opinion on BERT-AL:

https://openreview.net/pdf?id=SklnVAEFDB

timpal0l commented 4 years ago

@Raghavendra15 @nreimers Did you end up trialing it out with Faiss ? What were the results. ? I have a similar use case where I have a domain dataset (about 100k english sentences) related to fires, I want to find synthetic multilingual sentences in different languages (arabic, italian, chinese etc.). My thought was to download the Wikipedia corpus (source) for each language and embed both wikipedia and my fire data and find synthetic sentences.

By following this example Semantic Similarity

  • I was able to download the multilingual trained models distiluse-base-multilingual-cased
  • Embed about 4.2 Million Arabic sentences (took about 7hrs on p2.xlarge instance, 85% GPU utilization) and 100k Fire Sentences (took couple of minutes)

This is where it hangs/very slow -

  • Running the similarity using cdist seems to run forever, had to cancel after running it for a day. I did not expect for it to take this long. Even though it was very straightforward. Figuring there should be a more optimized way of doing this.

Is there something wrong with the steps I have taken, appreciate any help.

Cheers ! Ayub

Did you consider using XLM-R for your multilingual approach? (Generates language independent embeddings for semantic similarity)

nreimers commented 4 years ago

@timpal0l I tested XLM-R for multilingual sentence embeddings.

If used out-of-the-box (without further fine-tuning), the results are really bad, far worse than mBERT (mBERT is also really bad without fine-tuning).

The vector spaces for XLM-R are not aligned across languages, i.e. the same sentence in two different languages are mapped to completely different points in vector space.

However, when fine-tuned, you can get quite nice results with XLM-R for cross-lingual tasks. Currently I prepare some code + paper + models, which will be release soon in the sentence-transformers repository.

Best Nils Reimers

timpal0l commented 4 years ago

@nreimers Thanks for you reply!

I see. I have a unlabelled corpus consisting of several languages that I wish to fine tune XLM-R (just update the language models weights to get more domain specific embeddings). Not a down stream task like classification.

I cant seem to find any example code of doing this, have you managed to do this with XLM-R using HuggingFace? Could you give me any pointers?

Cheers

nreimers commented 4 years ago

Hi @timpal0l I think this is the file you need https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py

I haven't tested it by myself.

Best Nils Reimers

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ajitsingh98 commented 4 years ago

Hi @nreimers, I am using sentence transformers for finding stack overflow duplicate questions. I want to train the model from scratch but I am facing some issues. My training set contains questions and its duplicates only. Is it possible to train the model from this type of training data set.

nreimers commented 4 years ago

Hi @sajit9285 Just positive examples won't work. You somehow need to teach the network what is similar and what not.

But usually it is not an issue, as getting negative pairs is quite easy. The most simple strategy is just to sample two questions randomly. In 99,9999% of the cases they are non-duplicate and get the negative label.

A better strategy is to use hard negatives, as with the random strategy, your negatives are too easy to spot. One better way would to sample another random question with the same stackoverflow tag and treat it as negative. Or to find a similar question with ElasticSearch BM25 and to assume that it is a negative example

ajitsingh98 commented 4 years ago

@nreimers Thanks for your reply. I will try your stated methods. I used word2vec averaging method and sentence transformers with pretrained model('bert-base-nli-mean-tokens') for ranking similar questions, and I found word2vec averaging method (for sentence embeddings) performed better. May be the data has lots of tech terms! That's why I am thinking of training the model from scratch.

nreimers commented 4 years ago

@sajit9285 Yes, the NLI data sadly does not contain any computer science / programming specific examples, so it does not learn these terms. word2vec is trained on a much wider range of topics, so it has an understanding of programming terms.

ajitsingh98 commented 4 years ago

@nreimers So will it work if I trained it from scratch as stated in the that github repo?

nreimers commented 4 years ago

@sajit9285 As always, it depends on the quality of your training data. But I saw quite some good improvements for domain specific terms / sentences if you train it on appropriate training data

timpal0l commented 4 years ago

@sajit9285 Is it not better to use the existing weights as a base, rather than train something from scratch?

ajitsingh98 commented 4 years ago

@nreimers I will give a try. Thanks :)

ajitsingh98 commented 4 years ago

@timpal0l Yeah ofcourse, anytime they are better than random weights.

cabal-chan commented 4 years ago

@nreimers You are a beast! A lot of questions I had were addressed on here!

SageAgastya commented 4 years ago

@nreimers I have tried a lot replacing AllNLI files with my own dataset files in the same format. I have also changed labels (inside nliReader class' member function named get_labels) from 3 labels(contradiction,neutral,entailment) to two labels (true,false) for my task. But it is still printing those three labels and unable to detect my dataset. I tried a lot but now need your help now. I task that I trying to perform is fine tuning on bert which takes paired para/sentences as input.

njsdias commented 4 years ago

Hello @nreimers. I run all models available in https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/

only using

model = SentenceTransformer(model_name)

The model that gave the best results was: distiluse-base-multilingual-cased The results are very similar than USE (Universal Sentence Enconder).

My questions are:

  1. How I can improve the results of distiluse-base-multilingual-cased without clean all weird text cases that I have in dataset?
  2. Should I need explore fine-tune parameters? How I can do that?
  3. Should I need add more layers after the end layer? What are your suggestions?
  4. There are any way to use GloVe pre-trained model with SBERT? If yes, how I can do that?

I want to understand how I can navigate in your beautiful SBERT to do a little/easy modifications that can bring me better results.

Thank you for your help.

desaibhargav commented 4 years ago

Hi, BERT out-of-the-box is not the best option for this task, as the run-time in your setup scales with the number of sentences in your corpus. I.e., if you have 10,000 sentences/articles in your corpus, you need to classify 10k pairs with BERT, which is rather slow.

A better option is to generate sentence embeddings: Every sentence / article is mapped to a fixed sized vector. You need to map your 3k articles only once to a vector.

A new query is then also mapped to a vector. In this setup, you only need to run BERT for one sentence (at inference), independent how large your corpus is.

Then, you can use cosine-similiarity, or manhatten / euclidean distance to find sentence embeddings that are closest = that are the most similar.

I released today a framework which uses pytorch-transformers for exactly that purpose: https://github.com/UKPLab/sentence-transformers

I also uploaded an example for semantic search, where each sentence in a corpus is mapped to a vector and than cosine-similarity is used to find the most similar sentences / vectors: https://github.com/UKPLab/sentence-transformers/blob/master/examples/application_semantic_search.py

Let me know if you have further questions.

Hi there! Phenomenal work! I just had one question, how do transformer encodings (say BERT) compare against encodings from models like Google's Universal Sentence Encoder on a textual semantic similarity task?

nreimers commented 4 years ago

Hi @algoromeo Universal Sentence Encoder (USE) spans several different architectures. The USE large is based on transformer networks like BERT, i.e., the architectures are quite comparable. A big advantage of BERT is the language model pre-training, which induces a lot of information about language in the model. This pre-training is missing in USE. USE also has CNN networks, which are faster and runtime scales better with the input length. But their performance is usually worse than the transformer based architectures. So you trade speed for lower accurarcy.

desaibhargav commented 4 years ago

Hi @algoromeo Universal Sentence Encoder (USE) spans several different architectures. The USE large is based on transformer networks like BERT, i.e., the architectures are quite comparable. A big advantage of BERT is the language model pre-training, which induces a lot of information about language in the model. This pre-training is missing in USE. USE also has CNN networks, which are faster and runtime scales better with the input length. But their performance is usually worse than the transformer based architectures. So you trade speed for lower accurarcy.

Thank you for your timely and apt reply! Gave me the much needed clarity! Cheers!

SamALIENWARE commented 4 years ago

Hi @nreimers , thank you for your detailed explanation on many issues around sentence-bert and semantic textual similarity search. I am currently working on a social science project in which I am trying to measure the "cultural distinctiveness" (basically whether people are different from each other when they comment) of Reddit users based on their comments in certain posts.

I am thinking of treating all comments of each user as a document. Hopefully, I could obtain document embeddings using sentence transformers. Alternatively, I could use GloVe or Latent Semantic Analysis as embeddings of the document. After that, I am also hoping to compare each individual with the collectives he/she belongs to. So comparing text generated one user against text generated by a group of pre-defined people (and do that iteratively for every user in the dataset). Do you think sentence BERT is a suitable method to embed documents? Could you recommend any work related to the thing I am trying to do, please? Thank you!

nreimers commented 4 years ago

Hi @SamALIENWARE I am afraid that Sentence-BERT is not suitable for that.

BERT (&Co.) have a quadratic runtime and quadratic memory requirement with the text length. I.e., for long documents you would need extremely large memory and have an extremely long runtime. This is why BERT & Co. limit the length for the input document to 512 word pieces, which are about 300 words.

For your purpose I would use avg. GloVe embeddings (which are already implemented in the sentence-transformers project) or LSA/LDA (e.g. from Gensim).

Best Nils Reimers

SamALIENWARE commented 4 years ago

Hi @SamALIENWARE I am afraid that Sentence-BERT is not suitable for that.

BERT (&Co.) have a quadratic runtime and quadratic memory requirement with the text length. I.e., for long documents you would need extremely large memory and have an extremely long runtime. This is why BERT & Co. limit the length for the input document to 512 word pieces, which are about 300 words.

For your purpose I would use avg. GloVe embeddings (which are already implemented in the sentence-transformers project) or LSA/LDA (e.g. from Gensim).

Best Nils Reimers

Thanks a million, @nreimers ! I will definitely try your suggestions out.

I have tried distilled sentence BERT out yesterday. Perhaps its because there aren't that many data (19000+ users in my dataset), the "sentence" embeddings were acquired in a relatively short time period. Then I used k-means clustering on the embeddings and calculated the sum of the distance of each vector to the centroids of the clusters. I am thinking that the larger the sum, the more "distinct" the user's content is since it's semantically far from everyone else's.

So after I got embeddings using GloVe or LSA/LDA, do you think the euclidean distance to k-means centroids is a good representation of semantic textual similarity in a non-pairwise situation (1 vs. many)? Or is it better to stick to cosine similarity (calculate pairwise cosine similarity and then average), as the embedding models are trained using this metric?

Thank you again for your valuable time. I do appreciate it. Have a nice day!

SageAgastya commented 4 years ago

@nreimers , I have tried a lot replacing AllNLI files with my own dataset files in the same format. I have also changed labels (inside nliReader class' member function named get_labels) from 3 labels(contradiction,neutral,entailment) to two labels (true,false) for my task. But it is still printing those three labels and unable to detect my dataset. I tried a lot but now need your help now. I task that I trying to perform is fine tuning on bert which takes paired para/sentences as input.

saurabhsaxena86 commented 4 years ago

@nreimers Brillant Work!!! I just wanted to understand when we are doing evaluation we are using STS Bench Mark but when we have domain-specific data do we still need to STS or we can split our data into test and train and evaluate.

nreimers commented 4 years ago

Hi @saurabhsaxena86 No, in that case you don't need STS. If your domain specific data is suitable, you can of course train on that.

Shubhamsaboo commented 4 years ago

Hey @nreimers, I am bit confused on how to go about training the model from scratch on my dataset. Is there some resource which I can refer to. I am having hard time figuring out how to create the dataloader and train the model on specific data.

Shubhamsaboo commented 4 years ago

Hi @saurabhsaxena86 , can you please share the code on how you have trained the model on your domain specific data? That would be of great help!