odissei-lifecourse / life-sequencing-dutch

MIT License
0 stars 0 forks source link

Separate code and data as much as possible #27

Open f-hafner opened 4 months ago

f-hafner commented 4 months ago

I suggest that we completely separate the directories with code and the directories with data. This will make it easier to transfer files between the RA and OSSC. Because we cannot pull code from the RA to the OSSC with git, we have to do this either manually or with a winscp script. This would best be done with in a single step, ie deleting all existing files in a directory and transferring new ones. Currently, I'm doing this file-by-file.

I think this concerns both the network and the language models.

Let me know what you think, @dakota0064 and @tanzir5

tanzir5 commented 4 months ago

I agree. I will get to this as soon as I can.

f-hafner commented 3 weeks ago

I started with this on snellius, and I suggest we develop a useful structure of the data there, write code that fits it, and then move some data around on the OSSC to fit this. Specifically, the directory looks like this

|_cbs_data 
|    |_InkomenBestedingen
|_evaluation
|    |_*.pkl # various pickle files for the evaluation
|_graph
|    |_processed
|    |_walks
|    |_embeddings
|_liss_panel
|_llm
|    |_raw # files created by Lucas
|    |_processed # files after preprocessing by Tanzir
|    |_embeddings
|    |_models # checkpoints etc