Unifying the data architecture

f-hafner commented 1 month ago

We need to have a single ground truth of data. For instance, (intermediate) output that is produced by Lucas should be available to query for the evaluation. This will avoid bugs that we're currently trying to find, and will make it easier to add more experiments to the evaluation.

What I think we need is a relational database with several tables.

table income: reports the income for each person and year. has columns personid, year, income. has a unique index on (personid, year)
table background: has all variables that are fixed over time for a person. has columns personid, gender, year of birth, municipality of birth, etc. has unique index on personid.
table marriages. has columns person1, partner, date of the marriage. if a marries b, should there be a column for both of them? what's best here?
table deaths. has column personid, date. has unique index on personid.

I think we also need tables for the following, but I'm not sure what's the best structure for the table

any further tables with more information about the labor market. could we have a table job_spells? @Lsage ?
table with background data on real estate. should have columns object_id, year, value, neighborhood_id. and then a table with characteristics of the neighborhood, ie with columns neighborhood_id, year, var1, var2, ...
table(s) that report where a person lived and when. this may require more than one tables, and I'm not sure what the best structure is. an easy solution cold be to report the place of residence on a fixed day each year.
also include the LISS data. I need to check what they look like

later, we can extend this if necessary with more tables. we could also consider adding variables such as the json sequences and the embeddings. see #71

I suggest Lucas/Ana prepare the data in a csv file. I then upload the file to the OSSC and put it into a database. I think most of this is already done in https://github.com/odissei-lifecourse/life-sequencing-dutch/tree/main/pop2vec/evaluation/domain?

Am I missing anything here? is something not feasible?

f-hafner commented 1 month ago

any comments, @dakota0064 @tanzir5

f-hafner commented 1 month ago

figure out whether we use integers or strings for person IDs

odissei-lifecourse / life-sequencing-dutch

Unifying the data architecture #90