Closed hvgazula closed 1 year ago
@VeritasJoker Here's what the hierarchy will look like. Let me know what you think. Will get the base df in a bit.
To be honest, I really don't like the hierarchy to be pickles --> embeddings --> full/trimmed --> model --> cnxt --> layer
. Instead, I prefer swapping model
and full/trimmed
. I may revert to the latter structure as soon as I have an idea about Zaid's trimming in concat.py.
Please review https://github.com/hassonlab/247-pickling/pull/68 first before reviewing this PR
I am now wondering what's the merit in separating the base dataframe (input to the embedding generation functions) from the embeddings of each layer (in a hierarchical fashion). How is it (the decoupling) advantageous over having the embeddings from all layers in one file? If saving space was the motivation, the latter (which I wrote in the past) would suffice. If memory (loading files) is an issue, we might as well investigate other storage formats.
This looks like a good article to start with https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d
Yeah, that is a good point. The advantage I could think of is we have more flexibility with smaller and separate files. If we already have some pickles and want to add or modify new ones, we won't need to spend time loading and saving one huge file. Also, it would be easier to pass files around. Like we don't need to pass the whole file to Mariano if he just needs one type of embedding. Changing or modifying the hierarchy could be complicated but modifying columns is probably at the same level of complication.
For the hierarchy, I like pickles --> embeddings --> models --> full/trimmed --> cnxt_len --> layer files. I don't think we need a folder for layer files though.
I wasn't suggesting having all embedding "types" in one file. Rather, having all "layers" embeddings in one file.
Moving this to the discussion board for the next codebase meeting.
Oh yeah, I like that. Then we don't need context length and layer folders anymore. Under full/trimmed, it will be each file for each context length with all layers inside.
added jobname in makefile
delete embeddings after concat