Closed azaismarc closed 8 months ago
Hello!
I think it is indeed possible to improve your model performance by adding some domain knowledge to the base model. However, I think it'll be very difficult to do. The existing top embedding models are already trained with a significant amount of data & are quite strong - they'll be difficult to outperform. Beyond that, SetFit is primarily useful in low-data situations, otherwise finetuning e.g. a RoBERTa classifier will eventually do better. So, if you're interested in investing a lot of time and data, you might want to simply train a BERT-based classifier instead.
My recommendation is to stick with pretrained models and simply experiment with them. There are a lot of Sentence Transformer models that you can use: https://huggingface.co/models?library=sentence-transformers&sort=trending And also a lot of NLI-trained ones: https://huggingface.co/models?library=sentence-transformers&sort=trending&search=nli The "top" models are reported here: https://huggingface.co/spaces/mteb/leaderboard But don't get too carried away with the scoring. The top models are rarely actually the top models for your use cases.
Thank you for your response !
However, I'm just a poor French PhD student and train a supervised classifier with BERT-based LLM on large of labeled data is too much time consuming for me and money for my laboratory :(
Don't worry, I already have a simple unsupervised method for text classification based on word embeddings thanks to this paper for extract general thematic like noise, food, staff, location, etc
I would like to extract more complex label like "the noise comes from the hotel" and SetFit with ST + Fine-tuning on my "RNLI" seemed like a good idea because there are so many ways to write and interpret the same opinion (say "no double glazing" and "road is a problem for sleep" express the idea that are a problem with noise from outside. Furthermore, annotators can easily mislabel these opinions).
So, again thanks for your answer and I will make some test !
Last question, I couldn't find any information on what data these "best" models are trained on.
Hi !
I'm interested to use SetFit for classify text extracted from hotel reviews (booking, tripadvisor, etc) but I would to add domain knowledge to my Sentence Transfomers body.
For example, this paper use a Sentence Transformers model trained on a custom NLI dataset (RNLI for Review Natural Langage Inference) for extract product features without training on labeled data. The results show that a train on domain based NLI dataset is better that the MNLI for Zero-Shot aspect extraction.
So, is it a good approach to train my own Sentence Transformers model (or fine-tune a pre-trained) on NLI domain based dataset for improve performance of SetFit ?
Thank you in advance