Closed zkokaja closed 1 year ago
Multiple options to handle this, but first probably should calc some stats about how many words get tokenized multiple times:
Waiting on https://github.com/hassonlab/247-pickling/issues/141 Can only test token_is_root for now
Ken's results show no significant difference (very small) between the 4 methods above. I suggest we use either mean, first, or last as it will give us one embedding per word and make it easier to align models
let's do option 2, averaging
We currently do not filter out tokens of the same word from encoding. This means that if one word with one onset tokenizes into 3, the encoding model will get the same signal but 3 different embeddings. This departs from the previous way we did encoding, so we need to discuss and maybe test how it will work with or without the repetitions. Note that if you select to align with glove, this won't happen.
https://github.com/hassonlab/247-encoding/blob/03e73d281600e34e8a584025f0895b0e2aa93d69/scripts/tfsenc_read_datum.py#L214-L215