How effective is to use your own pre-trained ST model based on NLI dataset ?

azaismarc commented 9 months ago

Hi !

I'm interested to use SetFit for classify text extracted from hotel reviews (booking, tripadvisor, etc) but I would to add domain knowledge to my Sentence Transfomers body.

For example, this paper use a Sentence Transformers model trained on a custom NLI dataset (RNLI for Review Natural Langage Inference) for extract product features without training on labeled data. The results show that a train on domain based NLI dataset is better that the MNLI for Zero-Shot aspect extraction.

So, is it a good approach to train my own Sentence Transformers model (or fine-tune a pre-trained) on NLI domain based dataset for improve performance of SetFit ?

Thank you in advance

tomaarsen commented 9 months ago

Hello!

I think it is indeed possible to improve your model performance by adding some domain knowledge to the base model. However, I think it'll be very difficult to do. The existing top embedding models are already trained with a significant amount of data & are quite strong - they'll be difficult to outperform. Beyond that, SetFit is primarily useful in low-data situations, otherwise finetuning e.g. a RoBERTa classifier will eventually do better. So, if you're interested in investing a lot of time and data, you might want to simply train a BERT-based classifier instead.

My recommendation is to stick with pretrained models and simply experiment with them. There are a lot of Sentence Transformer models that you can use: https://huggingface.co/models?library=sentence-transformers&sort=trending And also a lot of NLI-trained ones: https://huggingface.co/models?library=sentence-transformers&sort=trending&search=nli The "top" models are reported here: https://huggingface.co/spaces/mteb/leaderboard But don't get too carried away with the scoring. The top models are rarely actually the top models for your use cases.

Tom Aarsen

azaismarc commented 9 months ago

Thank you for your response !

However, I'm just a poor French PhD student and train a supervised classifier with BERT-based LLM on large of labeled data is too much time consuming for me and money for my laboratory :(

Don't worry, I already have a simple unsupervised method for text classification based on word embeddings thanks to this paper for extract general thematic like noise, food, staff, location, etc

I would like to extract more complex label like "the noise comes from the hotel" and SetFit with ST + Fine-tuning on my "RNLI" seemed like a good idea because there are so many ways to write and interpret the same opinion (say "no double glazing" and "road is a problem for sleep" express the idea that are a problem with noise from outside. Furthermore, annotators can easily mislabel these opinions).

So, again thanks for your answer and I will make some test !

Last question, I couldn't find any information on what data these "best" models are trained on.

tomaarsen commented 9 months ago

I couldn't find any information on what data these "best" models are trained on.

For most MTEB models, the data is the secret sauce that makes it work well, so a lot of companies & model builders are very protective over theirs I'm afraid.

huggingface / setfit

How effective is to use your own pre-trained ST model based on NLI dataset ? #468