Open hvgazula opened 1 year ago
Huh good question. Is it possible to do two values for one patient key in the dict? I think 2 convos for 676 only fails for specific groups of models (seq-2-seq and per-utterance MLM models) so we can set that up specifically?
Another simpler method I can think of is to just not assert the number of conversations, as in if the conversation does not fit the patient, print out something but still do the concatenation anyways.
I'm not too fond of the latter as it will not tell us if, say, some conversations (jobs) were still pending (worst-case).
Seems like we can do away with the num_convs
check now that we are indexing. But in addition, we can perform a "merge test" on concatenated embeddings and the base df to ensure that all embeddings align with something in the base_df, and then we can print out a warning if conversations don't have embeddings.
@VeritasJoker How do you propose we handle this condition https://github.com/hassonlab/247-pickling/blob/f36f764f252db99b4dfb225b0931c231e3e8892f/scripts/tfsemb_concat.py#L30 in
if we go the route of https://github.com/hassonlab/247-pickling/blob/main/scripts/tfspkl_config.py#L11-L30