hassonlab / 247-encoding

Contains python scripts for performing encoding on 247 data.
0 stars 9 forks source link

Embedding manipulation should be before filtering #76

Open VeritasJoker opened 1 year ago

VeritasJoker commented 1 year ago

Inside tfsenc_read_datum.py, we are doing embedding manipulations, such as shifting embeddings or concatenating them, but we are doing this at the end of read_datum. This might be a problem if we are filtering out non_words and align with other models, which for instance reduces 90k gpt2-xl tokens to 60k (aligning with glove) for 625. This might harm encoding performance if we are shifting embeddings afterwards.

Consider moving all embedding manipulation stuff in front?