hassonlab / 247-encoding

Contains python scripts for performing encoding on 247 data.
0 stars 9 forks source link

Choose words or embeddings by default #60

Closed zkokaja closed 1 year ago

zkokaja commented 1 year ago

We currently do not filter out tokens of the same word from encoding. This means that if one word with one onset tokenizes into 3, the encoding model will get the same signal but 3 different embeddings. This departs from the previous way we did encoding, so we need to discuss and maybe test how it will work with or without the repetitions. Note that if you select to align with glove, this won't happen.

https://github.com/hassonlab/247-encoding/blob/03e73d281600e34e8a584025f0895b0e2aa93d69/scripts/tfsenc_read_datum.py#L214-L215

zkokaja commented 1 year ago

Multiple options to handle this, but first probably should calc some stats about how many words get tokenized multiple times:

  1. include all tokens
  2. average token embeddings per word
  3. take first, or last embedding per word
  4. remove any word that gets multiple tokens
VeritasJoker commented 1 year ago

Waiting on https://github.com/hassonlab/247-pickling/issues/141 Can only test token_is_root for now

zkokaja commented 1 year ago

Ken's results show no significant difference (very small) between the 4 methods above. I suggest we use either mean, first, or last as it will give us one embedding per word and make it easier to align models

zkokaja commented 1 year ago

let's do option 2, averaging