joeddav / blog

https://joeddav.github.io/blog
Apache License 2.0
25 stars 3 forks source link

Zero-Shot Learning in Modern NLP | Joe Davison Blog #2

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

Zero-Shot Learning in Modern NLP | Joe Davison Blog

State-of-the-art NLP models for text classification without annotated data

https://joeddav.github.io/blog/2020/05/29/ZSL.html

aced125 commented 3 years ago

Nice blog - I only had time to skim through the high level of each method. Which method does the transformers pipeline use?

joeddav commented 3 years ago

Nice blog - I only had time to skim through the high level of each method. Which method does the transformers pipeline use?

Thanks! The pipeline uses the NLI method.

dshahrokhian commented 3 years ago

This article is brilliantly written!

yurilla56 commented 3 years ago

Thank you, perfect article. Could you please suggest most suitable way how to classify text (contains N sentences) to expected label?

hishamkhrayzat51 commented 3 years ago

Thank you, amazing work. Can I see the code behind your online demo please?

joeddav commented 3 years ago

@hishamkhrayzat51 Yeah the repo is here.

clotildemiura commented 3 years ago

Hello, I'd like to kown on how many GPU your API for the Zero shot topic classification is running. Because, when trying to scan a 50 sentences text with 10 topics on Colab, it takes approximatively 5 minutes per text.... It looks like it's way faster on your web API though.

Thank you for your answer,

Clotilde

joeddav commented 3 years ago

@clotildemiura It's slow if you're not on GPU since you have to run each text/candidate label pair through the model separately. If the web API is significantly faster, it's probably just because the results for examples you're looking at are cached. The web API is also just using CPU.

A few tips for speeding up the pipeline here.

clotildemiura commented 3 years ago

thank you very much @joeddav

amitness commented 3 years ago

This is very interesting.

I had read two other papers on zero-shot learning sometime ago. The key ideas was:

  1. Training a binary classifier to predict if (text, label) pair match or not: (paper, summary)

  1. Training GPT-2 to generate the class given a multiple-choice question answer as prompt: (paper, summary)
gevezex commented 3 years ago

Really great article Joe! This will especially work for english text right? What would you advise for non english languages what don't have mnli datasets or nli trained bert models?

joeddav commented 3 years ago

@gevezex Yep, I actually trained a model on a multilingual NLI dataset for this exact purpose! Tweet here: https://twitter.com/joeddav/status/1298997753075232772

agombert commented 3 years ago

Hey Joe, great article!

I have a silly question about this in the few-shot learning for the embedding approaches:

Take the top K most frequent words V in the vocabulary of a word2vec model

By the top K most frequent words, do you mean the top K from the corpus you are trying to classify?

Thanks for the multilingual NLI, btw!

joeddav commented 3 years ago

@agombert Glad you enjoyed it! Sorry, this was difficult to communicate. The format of word vector files typically orders the words by inverse frequency in the algorithm's train corpus. I meant the top K according to that ordering. So if you have a .vec file with 100k words (lines), just use the first K.

dlmwright commented 3 years ago

Wondering about using bigrams in candidate labels = ["not sustainable","climate change","environment pollution","government state policy","finance bank] wondering what happens - will these work. I think b-grams could add more context.

elderpinzon commented 3 years ago

Fantastic article!

Just a minor fix: the model name in the last code snippet should be facebook/bart-large-mnli.

kk2211 commented 3 years ago

Fascinating Article Joe Is there any resource available on how to fine-tune such models with our own Data? Thanks

sidharkal commented 3 years ago

Really great article keep it up

mtortoli commented 3 years ago

Hi Joe, thanks for you article!! It is possible to fine-tune this models?

jackxxu commented 3 years ago

@joeddav thanks or the article. I find it very helpful.

do you happen to have the notebook/code available for mapping from s-bert to word2vec? I wonder how it is done and also how you generate the word2vec embedding for phrases such as "Science and Mathematics". 🤔

alisonreboud commented 2 years ago

Hi thanks a lot for the article and notebook. Just have a quick question , what is the default model in the pipeline is it Bart MNLI?

Boodhayana commented 2 years ago

Can you please show or direct me to a place where the fine-tuning is explained. I have about a 1000 sentences with their labels. I want to fine-tune this model on the task. During inference a subset of the labels will be used -- so zero shot learning would be the best way to go. But when you meant "pass the sentence twice, once with correct label and once with incorrect label while optimising cross-entropy", I want to see how that is done using HuggingFace.

kurah commented 2 years ago

As @Boodhayana said, I would also love to see the actual code that carries out the fine-tuning, I also have a data set that I want to fine tune the bart-mnli zero shot model on but can't find any examples of how to do so.

marouaghaouat commented 2 years ago

Could you please post the code you used to finetune bart-large-mnli on Yahoo answers ?

joeddav commented 2 years ago

Regrettably, I failed to save that code. If you need to fine-tune, I recommend first distilling a classifier using this script, (https://github.com/huggingface/transformers/tree/main/examples/research_projects/zero-shot-distillation) and then fine-tuning the resulting model as you would any other classifier.

​

On Apr 28 2022, at 4:34 AM, Maroua Ghaouat @.***> wrote:

Could you please post the code you used to finetune bart-large-mnli on Yahoo answers ?

—

Reply to this email directly, view it on GitHub (https://github.com/joeddav/blog/issues/2#issuecomment-1112047791), or unsubscribe (https://github.com/notifications/unsubscribe-auth/ACHLU2NCFWAFRMMLCYXM7ODVHJSVBANCNFSM4OPHI3AQ).

You are receiving this because you were mentioned.

Boodhayana commented 2 years ago

@joeddav np at all. I am able to successfully fine tune the model. Your blog, and your answers in HuggingFace forums helped me a lot. I have one concern, however. Since I am using the fine-tuned model in production, i would need it to be fast(as fast as normal text classification ones). I have ~30 labels in my dataset. I am accelerating the inference time by using "onnxruntime" on the huggingface model that i fine-tune.

The code for 'onnx'-ing is below

python -m transformers.onnx --model=facebook/bart-large-mnli --feature=sequence-classification --atol=1e-04 dir/`

Even after that, the inference time for one piece of text takes almost 2seconds(it has to iterated through 30 labels).

Are there any methods to further fasten the inference?

Does distillation help? Any other methods that i can use along with this? I want to match the inference time taken by normal text classification.

joeddav commented 2 years ago

@Boodhayana Distillation is exactly what you want. It will essentially train a student model, which is just a normal distilbert classifier, to mimic the predictions of the zero-shot teacher. You just need some example (unlabeled data).

tyatabe commented 2 years ago

@Boodhayana can you share or direct to place to understand how the fine tuning is actually done?

tyatabe commented 2 years ago

@joeddav for distillation what should the candidate labels be? I think it should be the candidate labels you want to use for your application, regardless of what the text you're using for distillation is about. For example, if I want to train a model to classify movie summaries into genres, I could use the AG news data to distill a zero-shot model into a smaller one, using hypotheses labels like ['thriller', 'action', 'suspense', 'horror', 'comedy'], even though the AG news data has nothing to do with that. Then I could fine tune that distilled model with actual movie summary - genre data, right?

tyatabe commented 2 years ago

Hey, thank you for getting back to me. I'm very excited to see that post! In the meantime I'm actually trying my hand with pytorch, and I'm wondering how to encode my labels. As suggested in the zero-shot learning blog post, I'm only using the labels entailment and contradiction, but I'm unsure what are the actual encodings used in the model. From this kaggle competition https://www.kaggle.com/competitions/contradictory-my-dear-watson I saw they're using 0, 1, or 2 (corresponding to entailment, neutral, and contradiction). Should I set up my encodings this way also? (0 for entailment and 2 for contradiction?)

Thank you,

Tada

On Sat, May 21, 2022 at 3:16 AM boodhayana @.***> wrote:

@Boodhayana https://github.com/Boodhayana Distillation is exactly what you want. It will essentially train a student model, which is just a normal distilbert classifier, to mimic the predictions of the zero-shot teacher. You just need some example (unlabeled data).

I plan to write a blog using a public dataset. So please wait a few days since i am using a private dataset that i cant share outside

— Reply to this email directly, view it on GitHub https://github.com/joeddav/blog/issues/2#issuecomment-1133491962, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGIE2U6CWZPMGZU4XS3AAZTVLA2P5ANCNFSM4OPHI3AQ . You are receiving this because you commented.Message ID: @.***>

-- Tadaishi Yatabe R.

http://tadaishi.wixsite.com/tada http://tadaishi.wix.com/tada

Boodhayana commented 2 years ago

@tyatabe You can do it two ways

  1. Do it in the normal way (default id2label), which has 0 for contradiction, 2 for entailment. After training, just swap 0s with 2s and submit it.
  2. A more technically correct way is to create a config and give a new id2label and label2id dictionaries to the config with 0 for entailment and 2 for contradiction. You can do that as explained in this LINK

NOTE: According to the competition details, you should not ignore the neutral class. You should consider all three outcomes.

Boodhayana commented 2 years ago

@joeddav I tried distillation after training my zero shot with 'bart-large-mnli'. I am using the parameters teacher_name_or_path and hypothesis_template along with classnames.txt and unlabeled_data.txt

I get the following error:

INFO|trainer.py:1244] 2022-06-01 10:53:09,793 >> ***** Running training ***** [INFO|trainer.py:1245] 2022-06-01 10:53:09,793 >> Num examples = 1472 [INFO|trainer.py:1246] 2022-06-01 10:53:09,793 >> Num Epochs = 1 [INFO|trainer.py:1247] 2022-06-01 10:53:09,793 >> Instantaneous batch size per device = 32 [INFO|trainer.py:1248] 2022-06-01 10:53:09,793 >> Total train batch size (w. parallel, distributed & accumulation) = 32 [INFO|trainer.py:1249] 2022-06-01 10:53:09,793 >> Gradient Accumulation steps = 1 [INFO|trainer.py:1250] 2022-06-01 10:53:09,793 >> Total optimization steps = 46 0%| | 0/46 [00:00<?, ?it/s]Traceback (most recent call last): File "/Users/boodhayana/ps2sem2/huggingface/distillatino/distill_classifier.py", line 338, in <module> main() File "/Users/boodhayana/ps2sem2/huggingface/distillatino/distill_classifier.py", line 328, in main trainer.train() File "/Users/boodhayana/.local/share/virtualenvs/huggingface-NdJ_jAKm/lib/python3.9/site-packages/transformers/trainer.py", line 1365, in train tr_loss_step = self.training_step(model, inputs) File "/Users/boodhayana/.local/share/virtualenvs/huggingface-NdJ_jAKm/lib/python3.9/site-packages/transformers/trainer.py", line 1940, in training_step loss = self.compute_loss(model, inputs) File "/Users/boodhayana/ps2sem2/huggingface/distillatino/distill_classifier.py", line 119, in compute_loss target_p = inputs["labels"] File "/Users/boodhayana/.local/share/virtualenvs/huggingface-NdJ_jAKm/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 239, in __getitem__ return self.data[item] KeyError: 'labels' 0%| | 0/46 [00:00<?, ?it/s]

This error comes after tokenizer tokenizes the entire thing. I thought that the model uses label in config, so we should change labels to label, but since there is a custom compute_loss function, I'm not so sure anymore. Can you please tell me what i can do now?

jsoutherland commented 1 year ago

@Boodhayana I avoided this by following the suggestion at the bottom of this thread to downgrade to:

transformers==4.4.0
datasets==1.6.1
jamesjohnson1025 commented 1 year ago

Hi, Below is the query i raised in hugging face. If you can answer it, i would love to hear from you. Thank you. The code is picked from https://huggingface.co/facebook/bart-large-mnli

Correct me if I am wrong please. I have picked both the versions ie. code under zero-shot classification pipeline and the code under manual pytorch versions and run against the labels ['Positive','Neutral','Negative'] for the sequence one day I will see the world. Below are the results.

Results (from zero-shot classification pipeline) {'sequence': 'one day I will see the world', 'labels': ['Positive', 'Negative', 'Neutral'], 'scores': [0.48784172534942627, 0.26007547974586487, 0.25208279490470886]}

Results (from Manual Pytorch Version; For the label 'Positive'} tensor([0.2946], grad_fn=)

If you notice from the both the results for the label positive, there is a huge variation. I ran the exact same code given in model page in order to test it. I am doing anything wrong ?. Please help me. Thank you.

Extra Information The logit values from Method Manual Pytorch after applying softmax tensor([[0.0874, 0.8761, 0.0365]], grad_fn=)

km5ar commented 1 year ago

Hi,

Could you share the py file for the streamlit demo?

dangnguyenngochai commented 1 year ago

Hello,

Could you provide more references on the techniques you mentioned learning a projection matrix from one embedding space to another ? Is it a separated model or the weights between these models would be shared and updated jointly in an end-to-end setting ?

bwbate commented 1 year ago

Your article has been quite helpful. But what originally caught my attention (and motivated me to incorporate Zero-Shot in my own project) was your live demo. I wanted to share it with a friend. Unfortunately, the site is currently throwing an error--"OSError: [Errno 28] No space left on device".

joeddav commented 1 year ago

@bwbate fixed, and moved to a space at https://huggingface.co/spaces/joeddav/zero-shot-demo

gattaloukik123 commented 1 year ago

Take the top K most frequent words V in the vocabulary of a word2vec model

I am trying to implement the latent embedding approach using an SBERT model for my phrases/documents and I want to use the word2vec projections for my class names. Do you think it is better to use a pretrained word2vec model? Or train it with a custom corpus or something?

joeddav commented 1 year ago

@gattaloukik123 I would almost always recommend using pretrained word embeddings (unless you have really weird data that doesn't look like normal text or something)

BrunoGomesCoelho commented 1 year ago

@joeddav your hugging face space currently fails with:

OSError: joeddav/xlm-roberta-large-xnli is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models' If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.