hassonlab / 247-pickling

Contains code to create pickles from raw/processed data
1 stars 10 forks source link

Revisit pickle hierarchy for whisper #161

Open VeritasJoker opened 1 year ago

VeritasJoker commented 1 year ago

Right now we have a pickle hierarchy structure like this: [patient]/pickles/embeddings/[model-name]/full/base_df.pkl and [patient]/pickles/embeddings/[model-name]/full/[cnxt_len]/layer#.pkl

However, with Whisper and MLM models, we have different types of extracted embeddings. For whisper, we have encoder only, decoder only, input shuffling, etc. For MLM, we have masked/unmasked, types of input (left contexts, right contexts, per utterance/sentence). Where should we differentiate these? The different types of embeddings should theoretically all have the same base_df.pkl, so maybe we do it on the cnxt_len level? So instead of a number for cnxt_len, it would be more like a string for embedding_type or something? But it feels very messy this way.

Just writing it down here so we remember to talk about it next meeting. I have all the code for whisper and old code for MLMs in the dev branch now. Will do the whisper replication and start to clean the code a bit.