hassonlab / 247-encoding

Contains python scripts for performing encoding on 247 data.
0 stars 9 forks source link

clean up filter_datum function in tfsenc_read_datum.py #44

Closed hvgazula closed 1 year ago

hvgazula commented 1 year ago

Replace this by filtering on in_glove50 column. @VeritasJoker agree? I'll take care of this clean up but just seeking your thoughts.

hvgazula commented 1 year ago

Is the snippet in this stack link any better to use?

VeritasJoker commented 1 year ago

We are already filtering on glove if glove is included in the align_with argument. For any LLM encoding, we also want the model_token_is_root when we are aligning with glove, that's why I added the condition there. Also, I think the current columns are in_glove and not in_glove50.

VeritasJoker commented 1 year ago

But yeah, you can take care of it since it relates to whatever you are doing on pickling side. Just note that I made some changes in the parser / config file to change the glove50 arguments to glove to fit it here so we will need to undo all those changes if we want glove50 here

zkokaja commented 1 year ago

If using HF for static embedding, then we'll go with that. Otherwise do glove50.

zkokaja commented 1 year ago

Change this line to glove50 as well: https://github.com/hassonlab/247-pickling/blob/main/scripts/tfsemb_LMBase.py#L57

zkokaja commented 1 year ago

revert these to use glove50 https://github.com/hassonlab/247-encoding/blob/main/scripts/tfsenc_read_datum.py#L179 https://github.com/hassonlab/247-encoding/blob/main/scripts/tfsenc_config.py#L62

VeritasJoker commented 1 year ago

Changed, but needs testing with newest pickles