Fake data for the evaluation module

Code to generate fake data that we use in our embedding evaluation

Main changes

create pickle files with fake person background data
extend create_fake_data.py to handle summary statistics from original datasets where we have only the summary statistic of 1 year
create pickle files with fake marriage data
create fake embeddings in hdf5 format
create a new class holding metadata information from our data summaries. I think this makes the code more understandable.

This creates all files necessary to run parts of the pop2vec/evaluation/slurm_scripts/isolate_evaluation_subsets.sh. Specifically, all embeddings are stored in hdf5 format already, so convert_pickle_embeddings.py and convert_embeddings_to_hdf5.py are not necessary for the fake data. After this, we one should be able to run the testbed

Notes

I have not tried out to use the fake data in the evaluation pipeline.
The embeddings I created are all only of dimension 8; it was taking too long otherwise for an unclear benefit. I leave this like this for now, but anything we run with these fake embeddings will not be indicative about performance of the code.
No embeddings were created for the very first LLM models that Tanzir had trained

odissei-lifecourse / life-sequencing-dutch

Fake data for the evaluation module #81

Main changes

Notes