Closed Superhzf closed 3 years ago
We assume that sentences are sequences of space-delimited words. For any word with some subword segmentation, we aggregate the representations of the subwords units into one representation. We took the representation of the last subword unit, but other aggregations are possible. These representations should be in HDF5 format.
We ignore special tokens, that is, they are not part of the input.
We had code for doing this for the different models in another branch. See this file: get_transformer_representations.py
. But keep in mind it's from an older version of the HuggingFace repo and might be obsolete. Also, there might be better ways to align subword tokenizations these days.
@boknilev That works! Thank you so much!
Hi,
I'm sorry for the many questions. My advisor asks me to implement this paper.
For the sake of matrix multiplication, I am wondering how you guarantee that the sequence length is the same for the same sentence of different models?
For example, if the sentence is as simple as
's
. Then,bert
thinks it includes two wordssegmented_tokens=["'", 's']
. However,gpt2
thinks it includes one wordsegmented_tokens=["'s"]
.Another possible issue is that,
bert
has special tokens like[SEP]
and[CLS]
butgpt2
does not. Do you remove those special tokens?I sincerely appreciate your reply!