embedding pickle hierarchy

hassonlab / 247-pickling

Contains code to create pickles from raw/processed data

1 stars 9 forks source link

embedding pickle hierarchy #65

Closed hvgazula closed 1 year ago

hvgazula commented 2 years ago

added jobname in makefile
delete embeddings after concat

hvgazula commented 2 years ago

@VeritasJoker Here's what the hierarchy will look like. Let me know what you think. Will get the base df in a bit.

hvgazula commented 2 years ago

To be honest, I really don't like the hierarchy to be pickles --> embeddings --> full/trimmed --> model --> cnxt --> layer. Instead, I prefer swapping model and full/trimmed. I may revert to the latter structure as soon as I have an idea about Zaid's trimming in concat.py.

hvgazula commented 2 years ago

Please review https://github.com/hassonlab/247-pickling/pull/68 first before reviewing this PR

hvgazula commented 2 years ago

I am now wondering what's the merit in separating the base dataframe (input to the embedding generation functions) from the embeddings of each layer (in a hierarchical fashion). How is it (the decoupling) advantageous over having the embeddings from all layers in one file? If saving space was the motivation, the latter (which I wrote in the past) would suffice. If memory (loading files) is an issue, we might as well investigate other storage formats.

hvgazula commented 2 years ago

This looks like a good article to start with https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d

VeritasJoker commented 2 years ago

Yeah, that is a good point. The advantage I could think of is we have more flexibility with smaller and separate files. If we already have some pickles and want to add or modify new ones, we won't need to spend time loading and saving one huge file. Also, it would be easier to pass files around. Like we don't need to pass the whole file to Mariano if he just needs one type of embedding. Changing or modifying the hierarchy could be complicated but modifying columns is probably at the same level of complication.

For the hierarchy, I like pickles --> embeddings --> models --> full/trimmed --> cnxt_len --> layer files. I don't think we need a folder for layer files though.

hvgazula commented 2 years ago

I wasn't suggesting having all embedding "types" in one file. Rather, having all "layers" embeddings in one file.

hvgazula commented 2 years ago

Moving this to the discussion board for the next codebase meeting.

VeritasJoker commented 2 years ago

Oh yeah, I like that. Then we don't need context length and layer folders anymore. Under full/trimmed, it will be each file for each context length with all layers inside.