Misalignment when concatenating base and embedding pickles

VeritasJoker commented 1 year ago

So there is a problem brought by pickle hierarchy that we should solve. It never came up because it only happens for 676 blenderbot (and now bert models).

In bbot/bert, we are limiting utterance length to 512, which means removing two conversations totally in 676 (convo 38 and 39, each containing a single utterance longer than 512). Thus, we won't have embeddings for these two conversations (The concatenation script accommodates by specifying 76 total convos instead of 78) but we would still have those two conversations in the base df.

Since now pickle hierarchy stores base df and embedding df separately, we need to address this problem when concatenating base df and embedding df in encoding. There are three different ways I could think of:

In pickling, we store index for embedding df as well. So when we concatenate in encoding we can merge by the index.
In pickling, we delete these two conversations in the base df.
In encoding, we delete these two conversations in the base df before concatenating.

I would prefer 1 because it seems cleaner but I would be fine with 2 as well (although we need some script to regenerate or modify base_df after generate-embedding fails). For now, I can do 3 to get the results but 3 feels more like a hack.

VeritasJoker commented 1 year ago

Actually 2 and 3 won't work at all because we are removing some other utterances in other convos as well. So it has to be some sort of indexing stored in embedding df.