Fine-Tune encodings on unsupervised data?

datistiquo commented 5 years ago

As I first heart about BERT I was thrilled. But still you need labeled data when fine-tune the model, right? Nevertheless, this will be better and you need far less data now than most of the algorithms before BERT, you still need labeled data to get fine-tuned text encodings? I want to use (maybe fine-tuned encodings) for a unsupervised document clustering or IR.

hsm207 commented 5 years ago

Yes, you need labeled data when fine-tuning the model.

If by fine-tuned text encodings you mean fine-tuned embeddings, then have a look at this script to extract the embeddings of your input (which are the output from each layer in the encoder).

datistiquo commented 5 years ago

I have a similiar task like QA but I want just asscociating the Answert to several Questions (so maybe no label, but i can create artificial labels). So I want to feed in my Questions and answers to learn encoding. What do I need to change (within the run_classifier?)? I have no answer start and ends just the whole text of the answer.

So, this would be a similiar task like sen6tence similiarity? Instead I want to train a document similiarity.

hsm207 commented 5 years ago

It does sound like your task is similar to sentence similarity. If that is the case, have a look at the MNLI Processor to see how you should create your dataset. If you want an embedding for questions and another embedding for answers so then you can compute the similarity between them, then BERT is not the right model, since BERT concatenates the question and answer to produce only 1 embedding. In this case, you need something like a siamese network.

datistiquo commented 5 years ago

Thanks.

If you want an embedding for questions and another embedding for answers so then you can compute the similarity between them, then BERT is not the right model, since BERT concatenates the question and answer to produce only 1 embedding

I never read this before? You mean that in case of sentence similiarity? There the twos are concatenated? Does this happens inside the MNLI Processor?

Could you use the sentence similairity Task for text (more than 2 sentences)?

But it should actually be possible with BERT to build a classifier where you feed in Qs and As and the according label like in the SQUAD task but without position?

I thought like in the SQUAD task, BERT learns how to associate leanrned Questions and Answers encodings?

hsm207 commented 5 years ago

I never read this before? You mean that in case of sentence similiarity? There the twos are concatenated? Does this happens inside the MNLI Processor?

Yes, the two are concatenated and it happens in the MNLI Processor. Check out this diagram from the paper:

Could you use the sentence similairity Task for text (more than 2 sentences)?

Yes. In the paper the term 'sentence' actually refers to a piece of text. So it can be a paragraph, a document, etc, provided that it's tokenized input can fit into BERT (max 512 tokens).

But it should actually be possible with BERT to build a classifier where you feed in Qs and As and the according label like in the SQUAD task but without position?

I thought like in the SQUAD task, BERT learns how to associate leanrned Questions and Answers encodings?

In the SQUAD task, BERT learns to associate questions encodings to input paragraphs encodings to extract positions. Check out the following diagram:

If you want to learn to associate question encodings with answer encodings, then you probably want to treat this as a sentence pair classification task. In this case, I'm not sure if you could treat the encodings from [CLS] to [SEP] as the question encoding and the rest as answer encoding.

datistiquo commented 5 years ago

Thank you very much!

I just wonder why I was so focused to input Question, Answer together with a label. I thougt in this case, the classification would be better as I supossed that there is an encoding learned both from the Question and Answer and not just a single classification with Question plus labels. Am I right, that in standard Squad during the training process there is learned an attention between Question and Answer (such that importance of question key words are learned for the specific Answer/Label)?

Also in sentence pair classification task there is attention learned (importance for words signalling the right class)?

That is why I wanted to apply this to my case similiar to Squad but without the positions. So it should be possible to declare the start and end postions to just one label and add a dense output layer?

EDIT: Just wonder, is it possible that actually the task for Squad and the sentence pair classification reduces to the same when igoring the postion of answer in Squad?

hsm207 commented 5 years ago

Am I right, that in standard Squad during the training process there is learned an attention between Question and Answer (such that importance of question key words are learned for the specific Answer/Label)?

I think that's wrong. Look at the diagram for SQUAD training, during training, the attention is applied between the question and input paragraph only. Based on the association between the question and input paragraph, the model will figure which part of the input paragraph is the answer to the question.

Also in sentence pair classification task there is attention learned (importance for words signalling the right class)?

Yes, this is correct.

So it should be possible to declare the start and end postions to just one label and add a dense output layer?

You mean to pair up your question and answer and then have the model classify whether the answer matches the question? If yes, then you can definitely do this, as this is basically sentence pair classification.

Just wonder, is it possible that actually the task for Squad and the sentence pair classification reduces to the same when igoring the postion of answer in Squad?

If you ignore the sentence position of answer in Squad, then your input is a pair of question and input paragraph. If you want to treat this as sentence pair classification, then what are you trying to classify? If you are trying to determine whether the input paragraph is an answer to the question, then the answer is always going to be yes, because by design, the answer to any question in SQUAD is always contained inside the input paragraph. Therefore, in my opinion, Q&A problems like SQUAD are very different compared to sentence pair classification.

datistiquo commented 5 years ago

Thanks. Ok there might be some confusion. For me the answer is equal to the input paragraph. I have a simple QA problem just associating just a text to a question.

I think that's wrong. Look at the diagram for SQUAD training, during training, the attention is applied between the question and input paragraph only. Based on the association between the question and input paragraph, the model will figure which part of the input paragraph is the answer to the question.

If input paragraph is the whole answer text, then you have the association I supposed to have (answer and question)?

If you ignore the sentence position of answer in Squad, then your input is a pair of question and input paragraph. If you want to treat this as sentence pair classification, then what are you trying to classify?

As I saw in sentence pair classification you also have a label like neutral... So I would do classification with Q,A and Label. In find this also confusing why you need a label for sentence pair classification? I want it like you stated, but this is wrong since you need labels!

hsm207 commented 5 years ago

If input paragraph is the whole answer text, then you have the association I supposed to have (answer and question)? Yes, that's correct.

As I saw in sentence pair classification you also have a label like neutral... So I would do classification with Q,A and Label. In find this also confusing why you need a label for sentence pair classification? I want it like you stated, but this is wrong since you need labels!

In the sentence pair example with 'neutal', the problem was natural language inference. More specifically, given sentence 1 and sentence 2, you want to know whether sentence 2 logically follows from sentence 1 (entailment) or the pair of sentence contradict each other (contradiction) or nothing can be inferred from the sentence pair (neutral). So you need labels so that the model can learn from it. Also, notice that the classification is done using the CLS token, which is associated with every token in the sentence pair so there is no separate embedding for sentence 1 and sentence 2.

In your case, why not just do sentence pair classification with label 1 (the pair is an answer) and 0 (the pair is not an answer)? Why do you want separate embeddings for question and answer?

datistiquo commented 5 years ago

So you need labels so that the model can learn from it.

I assume sentence pair classification classifies the label like the 3 above. But why can't you label a sentence pair with an according intent and have intent1,...intentN labels? Techniacally there is no difference, right? In the NLI case it learns when the pair conradicts or is neutral, so it should also learn to discrimante different intents for pairs?

In your case, why not just do sentence pair classification with label 1 (the pair is an answer) and 0 (the pair is not an answer)?

That sounds like an another good option. But, how many examples I would need for each pair such that other pairs are not an answer? Let's say I have 100 pairs for Q and As. Would it be a good idea to have the one example for the pair which is the answer and then, for the Q of this pair to label the rest 99 pairs (with answer from all other Qs) as "not the answer"?

hsm207 commented 5 years ago

But why can't you label a sentence pair with an according intent and have intent1,...intentN labels? Techniacally there is no difference, right?

Yes, there is no difference. In sentence pair classification, the number of labels can by integer >= 2. It all depends on your problem you are trying to solve.

In the NLI case it learns when the pair conradicts or is neutral, so it should also learn to discrimante different intents for pairs?

I don't understand your question. Could you provide an example to illustrate what you mean?

But, how many examples I would need for each pair such that other pairs are not an answer? As many as possible :)

You could try this approach. Let's say you have 100 pairs of Q and As. Then for each question, randomly sample say 5 answers from the other 99 answer (exclude the correct one). Now you will have 5 pairs of "not answer". You can experiment with the number of samples. You don't want to sample too much since it will make your data imbalance and classifier will probably just classify everything as "not answer". Please let me know how this works for you if you decide to follow this approach.

datistiquo commented 5 years ago

But why can't you label a sentence pair with an according intent and have intent1,...intentN labels? Techniacally there is no difference, right?

Yes, there is no difference. In sentence pair classification, the number of labels can by integer >= 2. It all depends on your problem you are trying to solve.

In the NLI case it learns when the pair conradicts or is neutral, so it should also learn to discrimante different intents for pairs?

I don't understand your question. Could you provide an example to illustrate what you mean?

I meant by that basically the same. In NLI it learns to distinguish pairs belonging to three different labels like neutral. So why not having intent1...intentN for different intents of the Question as the labels? This should be working as you said above, too.

hsm207 commented 5 years ago

I meant by that basically the same. In NLI it learns to distinguish pairs belonging to three different labels like neutral. So why not having intent1...intentN for different intents of the Question as the labels? This should be working as you said above, too.

Thanks for the clarification. So you want to label a question based on its intent and then given a question, predict the intent? This is just single sentence classification.

Or are you trying to classify intent based on a pair of question and answer?

datistiquo commented 5 years ago

1.) I think I will try both ways. With the second way I supposed it as the same like wie discussed above compared to NLI? Ther you have also a label plus Q and As!

2.) Basically I want to do this here:

BERT is able to solve NLP tasks that involve text classification given a pair of input texts. An example of such a problem is classifying whether two pieces of text are semantically similar

form here: https://medium.com/@_init_/why-bert-has-3-embedding-layers-and-their-implementation-details-9c261108e28a

You have Q,A pairs and the model shold learn where to pay attention to learn similiarity. How would I do that? Is it pair classification wher you need again labels?

hsm207 commented 5 years ago

1.) I think I will try both ways. With the second way I supposed it as the same like wie discussed above compared to NLI? Ther you have also a label plus Q and As!

Yes, you are right.

You have Q,A pairs and the model shold learn where to pay attention to learn similiarity. How would I do that? Is it pair classification wher you need again labels?

Yes, it is a paired classification and you need labels to indicate which pairs are "similar".

aabirouch-pi commented 3 years ago

Hi everyone, i guess my question is kind of the same issue: I was working on an NLP project for informal languages, one of the steps during this project is to generate word embeddings for this specific informal language using a pretrained BERT model trained on Standard Arabic language. The idea is I want to use the pretrained model (trained on Arabic) and fine tune it on the corpus of the informal language I have (which is close to Arabic in term of morphology) using a new corpus. I've looked over many tutorials and all I can find is examples for classification problems and not for word embedding training and generation. If you know any tutorial or blog that can help me, please refer it. Thank you.

google-research / bert

Fine-Tune encodings on unsupervised data? #448