johnmwu / contextual-corr-analysis

17 stars 5 forks source link

Would you mind sharing how to keep sequence length aligned between models? #4

Closed Superhzf closed 3 years ago

Superhzf commented 3 years ago

Hi,

I'm sorry for the many questions. My advisor asks me to implement this paper.

For the sake of matrix multiplication, I am wondering how you guarantee that the sequence length is the same for the same sentence of different models?

For example, if the sentence is as simple as 's. Then, bert thinks it includes two words segmented_tokens=["'", 's']. However, gpt2 thinks it includes one word segmented_tokens=["'s"].

Another possible issue is that, bert has special tokens like [SEP] and [CLS] but gpt2 does not. Do you remove those special tokens?

I sincerely appreciate your reply!

boknilev commented 3 years ago

We assume that sentences are sequences of space-delimited words. For any word with some subword segmentation, we aggregate the representations of the subwords units into one representation. We took the representation of the last subword unit, but other aggregations are possible. These representations should be in HDF5 format.

We ignore special tokens, that is, they are not part of the input.

We had code for doing this for the different models in another branch. See this file: get_transformer_representations.py. But keep in mind it's from an older version of the HuggingFace repo and might be obsolete. Also, there might be better ways to align subword tokenizations these days.

Superhzf commented 3 years ago

@boknilev That works! Thank you so much!