Right now we have a pickle hierarchy structure like this:
[patient]/pickles/embeddings/[model-name]/full/base_df.pkl
and
[patient]/pickles/embeddings/[model-name]/full/[cnxt_len]/layer#.pkl
However, with Whisper and MLM models, we have different types of extracted embeddings. For whisper, we have encoder only, decoder only, input shuffling, etc. For MLM, we have masked/unmasked, types of input (left contexts, right contexts, per utterance/sentence). Where should we differentiate these? The different types of embeddings should theoretically all have the same base_df.pkl, so maybe we do it on the cnxt_len level? So instead of a number for cnxt_len, it would be more like a string for embedding_type or something? But it feels very messy this way.
Just writing it down here so we remember to talk about it next meeting. I have all the code for whisper and old code for MLMs in the dev branch now. Will do the whisper replication and start to clean the code a bit.
Right now we have a pickle hierarchy structure like this:
[patient]/pickles/embeddings/[model-name]/full/base_df.pkl
and[patient]/pickles/embeddings/[model-name]/full/[cnxt_len]/layer#.pkl
However, with Whisper and MLM models, we have different types of extracted embeddings. For whisper, we have encoder only, decoder only, input shuffling, etc. For MLM, we have masked/unmasked, types of input (left contexts, right contexts, per utterance/sentence). Where should we differentiate these? The different types of embeddings should theoretically all have the same
base_df.pkl
, so maybe we do it on thecnxt_len
level? So instead of a number forcnxt_len
, it would be more like a string forembedding_type
or something? But it feels very messy this way.Just writing it down here so we remember to talk about it next meeting. I have all the code for whisper and old code for MLMs in the
dev
branch now. Will do the whisper replication and start to clean the code a bit.