[Clarification] Feature vectors : Creating the input file

google-research / bert

TensorFlow code and pre-trained models for BERT

https://arxiv.org/abs/1810.04805

Apache License 2.0

37.87k stars 9.56k forks source link

[Clarification] Feature vectors : Creating the input file #34

Closed astariul closed 5 years ago

astariul commented 5 years ago

As I understand, we need to input to the script extract_features.py the dataset we will use for the model build on top of BERT embeddings. This allows the model to do supplementary training on data specific to the data set. 2 sentences are used (separated by '|||') in order to train the Next Sentence Prediction feature. Right ?

From the paper :

To generate each training input sequence, we sample two spans of text from the corpus, which we refer to as “sentences” even though they are typically much longer than single sentences (but can be shorter also)

If I want to create my input file from a dataset where data is documents, should I take the same approach (splitting in the middle, even if there is more than 2 sentences), or strictly split every sentence ? Which approach will give the best accuracy ?

For example, let's say I have this data row :

doc1 = "Sentence 1. Sentence 2. Sentence 3." doc2 = "Sentence 4. Sentence 5." label = X

Then should I split like this :

Sentence 1. Sentence 2. ||| Sentence 3. Sentence 4. ||| Sentence 5.

Or like this :

Sentence 1. ||| Sentence 2. Sentence 2. ||| Sentence 3. Sentence 4. ||| Sentence 5.

Or any other way I didn't think of ? (I should not link Sentence 3 and Sentence 4 together right ? As they are potentially not following each other.)

Thanks again for the brilliant work.

jacobdevlin-google commented 5 years ago

This is actually two different things. If you want to run pre-training for additional steps on an in-domain text corpus, you should use create_pretraining_data.py and run_pretraining.py, as specified in Pre-training with BERT. In this case, you should put one sentence per line, and the create_pretraining_data.py script will pack them to the max sequence length.

If you're running a pairwise classification task like MultiNLI or MRPC (or even SQuAD) and you want to do feature-based training rather than fine-tuning, then you should pack it with SentenceA ||| Sentence B and call extract_features.py. The purpose of the ||| symbol is so that the script puts the sentence embedding tokens in the right place. But it's not being used to train the next sentence prediction, it's being used to be consistent with how the model was already pre-trained. E.g., for SQuAD SentenceA is the question and SentenceB is the paragraph, so it would be like:

who was the 16th president of the united states ? ||| abraham lincoln was the 16th president of the united states ...

astariul commented 5 years ago

@jacobdevlin-google

The purpose of the ||| symbol is so that the script puts the sentence embedding tokens in the right place

So there might or might not be a link between the 2 sentences ? In the given example, the 2 examples are linked. But for example in the case of MultiNLI, if we have contradiction sentences, we still pack them together with |||, so they can have the right place, no matter how related they are ?

Like, doing this is okay ? (contradiction) A man inspects the uniform of a figure in some East Asian country ||| The man is sleeping

More generally, what is the effect of doing this in the input file : A man inspects the uniform of a figure in some East Asian country ||| The man is sleeping

instead of this : A man inspects the uniform of a figure in some East Asian country The man is sleeping

jacobdevlin-google commented 5 years ago

Doing this:

A man inspects the uniform of a figure in some East Asian country ||| The man is sleeping

Will create a single sequence with SentenceA embeddings for the first part and SentenceB embeddings for the second part. In this case the feature vector of each word in both sentences will be conditioned on both sentences. For something like MultiNLI, this will be better, because this is how the next sentence prediction was trained and it's also how we fine-tune.

Doing this:

A man inspects the uniform of a figure in some East Asian country
The man is sleeping

Will extract out two independent representations. In this case the feature vector for each word will only be conditioned on the words from that sentence. This will probably still do OK because you will condition the representations on one another in your fine-tuning network, but it will probably not be as good as the first way for sentence pair tasks.

Of course, for single sentence tasks you should do it the second way. Also if you want to train a "dual encoder" style network where you generate a single vector for each sentence and then take the dot product, you should also do it the second way.

astariul commented 5 years ago

Thank you for your clarification ! Your advice on single sentence tasks is really helpful ^^

I have 2 last questions :

In the given example, the 2 sentences are a contradiction. You said :

In this case the feature vector of each word in both sentences will be conditioned on both sentences

Obviously we want this behavior for entailed sentences. But for contradicted sentences ? Don't we want to link them as less as possible ?

I want to do a "dual-encoder" network, so as you said, I will use the second way to get feature vector. But in my case, I am working with document (made of several sentences). Should I treat the document as one big sentence, or should I separate each sentence of the document and link them in the input file with ||| (still separating each document) ?

jacobdevlin-google commented 5 years ago

If your task is MultiNLI, then you're trying to predict whether the sentences entail or contradict, so your input representation can't be dependent on whether the sentences are entail or contradictory.

You can only have at most a single ||| per example because only sentence pairs were used for pre-training. I'm not sure exactly what your task is, but for example, if you're doing a QA-type task, then the sentence pair representation would be like this:

question ||| document

So if you were doing dual encoder then you should do get representations of the question and document that are not conditioned on one another:

question
document

So your document should be on its own line with no |||. Of course if your document is longer than 512 tokens you'll need to split it up. The extract_features.py script will truncate it so you might need to modify that.

astariul commented 5 years ago

you're trying to predict whether the sentences entail or contradict, so your input representation can't be dependent on whether the sentences are entail or contradictory.

I feel dumb now, but this sentence makes it very clear, thanks.

So your document should be on its own line with no |||. Of course if your document is longer than 512 tokens you'll need to split it up.

Thank you very much for your time and answer ! I got it finally ^^