Code to generate fake data that we use in our embedding evaluation
Main changes
create pickle files with fake person background data
extend create_fake_data.py to handle summary statistics from original datasets where we have only the summary statistic of 1 year
create pickle files with fake marriage data
create fake embeddings in hdf5 format
create a new class holding metadata information from our data summaries. I think this makes the code more understandable.
This creates all files necessary to run parts of the pop2vec/evaluation/slurm_scripts/isolate_evaluation_subsets.sh. Specifically, all embeddings are stored in hdf5 format already, so convert_pickle_embeddings.py and convert_embeddings_to_hdf5.py are not necessary for the fake data.
After this, we one should be able to run the testbed
Notes
I have not tried out to use the fake data in the evaluation pipeline.
The embeddings I created are all only of dimension 8; it was taking too long otherwise for an unclear benefit. I leave this like this for now, but anything we run with these fake embeddings will not be indicative about performance of the code.
No embeddings were created for the very first LLM models that Tanzir had trained
Code to generate fake data that we use in our embedding evaluation
Main changes
create_fake_data.py
to handle summary statistics from original datasets where we have only the summary statistic of 1 yearThis creates all files necessary to run parts of the
pop2vec/evaluation/slurm_scripts/isolate_evaluation_subsets.sh
. Specifically, all embeddings are stored in hdf5 format already, soconvert_pickle_embeddings.py
andconvert_embeddings_to_hdf5.py
are not necessary for the fake data. After this, we one should be able to run the testbedNotes