hassonlab / 247-encoding

Contains python scripts for performing encoding on 247 data.
0 stars 9 forks source link

Misalignment when concatenating base and embedding pickles #50

Closed VeritasJoker closed 1 year ago

VeritasJoker commented 1 year ago

So there is a problem brought by pickle hierarchy that we should solve. It never came up because it only happens for 676 blenderbot (and now bert models).

In bbot/bert, we are limiting utterance length to 512, which means removing two conversations totally in 676 (convo 38 and 39, each containing a single utterance longer than 512). Thus, we won't have embeddings for these two conversations (The concatenation script accommodates by specifying 76 total convos instead of 78) but we would still have those two conversations in the base df.

Since now pickle hierarchy stores base df and embedding df separately, we need to address this problem when concatenating base df and embedding df in encoding. There are three different ways I could think of:

  1. In pickling, we store index for embedding df as well. So when we concatenate in encoding we can merge by the index.
  2. In pickling, we delete these two conversations in the base df.
  3. In encoding, we delete these two conversations in the base df before concatenating.

I would prefer 1 because it seems cleaner but I would be fine with 2 as well (although we need some script to regenerate or modify base_df after generate-embedding fails). For now, I can do 3 to get the results but 3 feels more like a hack.

VeritasJoker commented 1 year ago

Actually 2 and 3 won't work at all because we are removing some other utterances in other convos as well. So it has to be some sort of indexing stored in embedding df.

hvgazula commented 1 year ago

@VeritasJoker Very well written 👍. I like such detailed explanations.

zkokaja commented 1 year ago

So for 1, we need to NOT reset the base and embedding dataframe index in pickling.

zkokaja commented 1 year ago

In pickling, this can be changed here: https://github.com/hassonlab/247-pickling/blob/main/scripts/tfsemb_main.py#L546 by reusing the same df that is passed into the function or by copy its index into the new df.

zkokaja commented 1 year ago

And this line in encoding to pd.merge: https://github.com/hassonlab/247-encoding/blob/main/scripts/tfsenc_read_datum.py#L495

zkokaja commented 1 year ago

Waiting on pickling dev to be merged with main before making this change in encoding so we can regenerate all embeddings

zkokaja commented 1 year ago

Regenerate embeddings so they have the new index and then test this in encoding