Closed astariul closed 5 years ago
This is actually two different things. If you want to run pre-training for additional steps on an in-domain text corpus, you should use create_pretraining_data.py
and run_pretraining.py
, as specified in Pre-training with BERT. In this case, you should put one sentence per line, and the create_pretraining_data.py
script will pack them to the max sequence length.
If you're running a pairwise classification task like MultiNLI or MRPC (or even SQuAD) and you want to do feature-based training rather than fine-tuning, then you should pack it with SentenceA ||| Sentence B
and call extract_features.py
. The purpose of the |||
symbol is so that the script puts the sentence embedding tokens in the right place. But it's not being used to train the next sentence prediction, it's being used to be consistent with how the model was already pre-trained. E.g., for SQuAD SentenceA
is the question and SentenceB
is the paragraph, so it would be like:
who was the 16th president of the united states ? ||| abraham lincoln was the 16th president of the united states ...
@jacobdevlin-google
The purpose of the
|||
symbol is so that the script puts the sentence embedding tokens in the right place
So there might or might not be a link between the 2 sentences ? In the given example, the 2 examples are linked. But for example in the case of MultiNLI, if we have contradiction sentences, we still pack them together with |||
, so they can have the right place, no matter how related they are ?
Like, doing this is okay ? (contradiction)
A man inspects the uniform of a figure in some East Asian country ||| The man is sleeping
More generally, what is the effect of doing this in the input file :
A man inspects the uniform of a figure in some East Asian country ||| The man is sleeping
instead of this :
A man inspects the uniform of a figure in some East Asian country
The man is sleeping
Doing this:
A man inspects the uniform of a figure in some East Asian country ||| The man is sleeping
Will create a single sequence with SentenceA embeddings for the first part and SentenceB embeddings for the second part. In this case the feature vector of each word in both sentences will be conditioned on both sentences. For something like MultiNLI, this will be better, because this is how the next sentence prediction was trained and it's also how we fine-tune.
Doing this:
A man inspects the uniform of a figure in some East Asian country
The man is sleeping
Will extract out two independent representations. In this case the feature vector for each word will only be conditioned on the words from that sentence. This will probably still do OK because you will condition the representations on one another in your fine-tuning network, but it will probably not be as good as the first way for sentence pair tasks.
Of course, for single sentence tasks you should do it the second way. Also if you want to train a "dual encoder" style network where you generate a single vector for each sentence and then take the dot product, you should also do it the second way.
Thank you for your clarification ! Your advice on single sentence tasks is really helpful ^^
I have 2 last questions :
In the given example, the 2 sentences are a contradiction. You said :
In this case the feature vector of each word in both sentences will be conditioned on both sentences
Obviously we want this behavior for entailed sentences. But for contradicted sentences ? Don't we want to link them as less as possible ?
I want to do a "dual-encoder" network, so as you said, I will use the second way to get feature vector.
But in my case, I am working with document (made of several sentences).
Should I treat the document as one big sentence, or should I separate each sentence of the document and link them in the input file with |||
(still separating each document) ?
If your task is MultiNLI, then you're trying to predict whether the sentences entail or contradict, so your input representation can't be dependent on whether the sentences are entail or contradictory.
You can only have at most a single |||
per example because only sentence pairs were used for pre-training. I'm not sure exactly what your task is, but for example, if you're doing a QA-type task, then the sentence pair representation would be like this:
question ||| document
So if you were doing dual encoder then you should do get representations of the question and document that are not conditioned on one another:
question
document
So your document should be on its own line with no |||
. Of course if your document is longer than 512 tokens you'll need to split it up. The extract_features.py script will truncate it so you might need to modify that.
you're trying to predict whether the sentences entail or contradict, so your input representation can't be dependent on whether the sentences are entail or contradictory.
I feel dumb now, but this sentence makes it very clear, thanks.
So your document should be on its own line with no
|||
. Of course if your document is longer than 512 tokens you'll need to split it up.
Thank you very much for your time and answer ! I got it finally ^^
As I understand, we need to input to the script
extract_features.py
the dataset we will use for the model build on top of BERT embeddings. This allows the model to do supplementary training on data specific to the data set. 2 sentences are used (separated by '|||') in order to train the Next Sentence Prediction feature. Right ?From the paper :
If I want to create my input file from a dataset where data is documents, should I take the same approach (splitting in the middle, even if there is more than 2 sentences), or strictly split every sentence ? Which approach will give the best accuracy ?
For example, let's say I have this data row :
Then should I split like this :
Sentence 1. Sentence 2. ||| Sentence 3.
Sentence 4. ||| Sentence 5.
Or like this :
Sentence 1. ||| Sentence 2.
Sentence 2. ||| Sentence 3.
Sentence 4. ||| Sentence 5.
Or any other way I didn't think of ? (I should not link Sentence 3 and Sentence 4 together right ? As they are potentially not following each other.)
Thanks again for the brilliant work.