google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
38.11k stars 9.6k forks source link

Extracting features on for long sequences / SQuAD #66

Closed ethanjperez closed 5 years ago

ethanjperez commented 5 years ago

Would it be difficult for you to release a feature extractor that works for sequences longer than 512, in the same way long SQuAD passages are handled? That would be quite helpful for being able to train small models on longer sequences (e.g., some SQuAD passages) and especially tasks (e.g. in summarization) that were previously prohibitively long for those without much compute.

Also, for SQuAD specifically, do you know how many passages are omitted for being longer than 512 tokens?

Thanks for your help and for releasing the pre-trained weights and code. This is a great service to the community 👍

jacobdevlin-google commented 5 years ago

For SQuAD we actually used 384 and about 5% of passages were longer than that. My recommendation would be to modify extract_features.py to take a sliding window approach. So if you have:

the man went to the store and bought a gallon of milk.

With a maximum length of 6, you don't want to do;

the man went to the store
and bought a gallon of milk

Because then the words on the edge wont have context. Instead what you want to do is have some overlap window:

the man went to the store
to the store and bought a
and bought a gallon of milk

And then take the representation of each word with maximal context. (E.g., the version of store in the second sentence has maximum context).

You can see run_squad.py for how we do it in the fine-tuning method, and then apply the same technique in extract_features.py.

We will probably not release a feature extractor do this specifically because there are too many use cases and details to get this right in every case (e.g., SQuAD is different from other scenarios because it needs to include the question. But others might not need to.)

ethanjperez commented 5 years ago

Okay got it, that makes sense. Thanks for your help!

wayfarerjing commented 5 years ago

I understand that in the Squad task sliding window technique can be applied. But what if it is a classification task? How should ground-truth label be assigned to sub-texts? I don't think assigning the same label as its original long text very harmful. But I don't think it's reasonable either.

Have you been able to figure this out? Thank you.

billiebaechan commented 5 years ago

Hi there,

Does anyone have advice on applying the sliding window technique for classification? As @wayfarerjing commented above, assigning the same label to all the windows produced from the same long body of text might be detrimental to the performance since it produces an unbalanced distribution of labels in the training set.

I appreciate any advice you might have. Thank you.

MatthijsRijlaarsdam commented 5 years ago

Hi there,

We've made a sliding window approach that somewhat works for classification, at least for non-training purposes.

We have made our own dataset of (long) scientific papers and multiple choice questions that we want BERT to answer. Because of the labelling issue @billiebaechan mentions, we train it on the RACE dataset, which has sequences that are shorter than 512. For our own dataset, we split all the input texts in snippets of 350 characters with a stride of 35, keeping the rest for our questions and answers. The labels are kept the same for all snippets. We then print the questions+given answers+logits to a file, and keep as a final prediction the answer with the highest logit. This improves our results quite a bit (from 32 to 39% correct), although our dataset might still be a bit too hard for BERT right now (best result we got is around 50%).

Hope this helps at least somewhat.

jind11 commented 5 years ago

@MatthijsRijlaarsdam hi, thanks for your answer. did you average the logits obtained from each snippet and then do the prediction of the answer with the highest logit? for this case, would it possible that the improvement comes from the ensemble effects?

MatthijsRijlaarsdam commented 5 years ago

No, our original dataset had input texts and multiple questions per input text. We fed every question, input text pair as a separate input to the network. In the split up dataset, we created snippets from every input text and added the questions for that input text to every snippet.

We then picked as final answer for a given question the answer with the highest logit over all snippets from the original input text. We didn't take averages. This improves performance significantly. In the original case we were throwing away over 99% of our data, and now we at least considering the entire text (though not as a whole).

candalfigomoro commented 5 years ago

@billiebaechan I have some ideas...

1) Let's start from the simplest one. Do you think that creating a document embedding by averaging all the word embeddings is going to work? Then you could train a classifier using the document embeddings.

2) Train a Doc2Vec PV-DM model to create document embeddings (instead of averaging), using non-trainable BERT embeddings for the word vectors (you just train the document matrix).

3) Train a stateful recurrent autoencoder (e.g. a stateful LSTM autoencoder) to create document embeddings. Reset the state after every different document. Then you could train a classifier using the document embeddings.

4) Use the same label for the windows (of the sliding window) with a steteful RNN (e.g. a stateful LSTM) classifier. Reset the state after every different document.

5) Take all the word embeddings for all the documents. Keep track of the document class for every word embedding. Apply dimensionality reduction if necessary... it's probably necessary (e.g. UMAP... UMAP also allows to fit on some points and project new points into the same space). Apply a clustering algorithm (e.g. HDBSCAN) to the embeddings. You will get some clusters. Because you tracked the document class for every word embedding, you can now see that some clusters of words contain words mostly beloging to a specific document class (hopefully). Take the most discriminating clusters (clusters made of class-unbalanced points). Use the number (count) of words belonging to discriminating clusters as features for a classifier (e.g. a document could contain 100 words belonging to a cluster, another document only 5 words). If you use HDBSCAN, you can use approximate_predict() to assign a cluster to new words.

Just some brainstorming...

TianrenWang commented 5 years ago

@MatthijsRijlaarsdam

we train it on the RACE dataset, which has sequences that are shorter than 512.

What do you mean the sequences are shorter than 512? I just took the longest article out of that dataset in the training, tokenized it, and the resulting list has a length of roughly 1300. Even if you don't tokenize it (which wouldn't make any sense), the length of the article was at least 800 words long.

Adherer commented 5 years ago

mark