Inside tfsenc_read_datum.py, we are doing embedding manipulations, such as shifting embeddings or concatenating them, but we are doing this at the end of read_datum. This might be a problem if we are filtering out non_words and align with other models, which for instance reduces 90k gpt2-xl tokens to 60k (aligning with glove) for 625. This might harm encoding performance if we are shifting embeddings afterwards.
Consider moving all embedding manipulation stuff in front?
Inside
tfsenc_read_datum.py
, we are doing embedding manipulations, such as shifting embeddings or concatenating them, but we are doing this at the end ofread_datum
. This might be a problem if we are filtering outnon_words
and align with other models, which for instance reduces 90k gpt2-xl tokens to 60k (aligning with glove) for 625. This might harm encoding performance if we are shifting embeddings afterwards.Consider moving all embedding manipulation stuff in front?