hassonlab / 247-pickling

Contains code to create pickles from raw/processed data
1 stars 9 forks source link

Create map for conversation count #135

Open hvgazula opened 1 year ago

hvgazula commented 1 year ago

@VeritasJoker How do you propose we handle this condition https://github.com/hassonlab/247-pickling/blob/f36f764f252db99b4dfb225b0931c231e3e8892f/scripts/tfsemb_concat.py#L30 in

CONVERSATIONS_MAP = {
    "podcast": dict.fromkeys(
        SUBJECTS["podcast"],
        1,
    ),
    "tfs": dict.fromkeys(["625", "676", "7170", "798"], [54, 78, 24, 15]),
}

if we go the route of https://github.com/hassonlab/247-pickling/blob/main/scripts/tfspkl_config.py#L11-L30

VeritasJoker commented 1 year ago

Huh good question. Is it possible to do two values for one patient key in the dict? I think 2 convos for 676 only fails for specific groups of models (seq-2-seq and per-utterance MLM models) so we can set that up specifically?

Another simpler method I can think of is to just not assert the number of conversations, as in if the conversation does not fit the patient, print out something but still do the concatenation anyways.

hvgazula commented 1 year ago

I'm not too fond of the latter as it will not tell us if, say, some conversations (jobs) were still pending (worst-case).

zkokaja commented 1 year ago

Seems like we can do away with the num_convs check now that we are indexing. But in addition, we can perform a "merge test" on concatenated embeddings and the base df to ensure that all embeddings align with something in the base_df, and then we can print out a warning if conversations don't have embeddings.