odissei-lifecourse / life-sequencing-dutch

MIT License
0 stars 0 forks source link

Enriching Life-course Sequences with a Data-driven Approach #98

Open tanzir5 opened 1 month ago

tanzir5 commented 1 month ago

We have a lot of columns in the raw data files we have. Till now Lucas has handpicked some columns and created sequences out of it which we have used. But the median length of the sequences are 30 tokens with the max being around 70 tokens.

From a computational resources limitation perspective, we are limited by the total length of a person's sequence (512 tokens) and the vocabulary size (# of unique tokens, which should be around 10k-20k [needs to be checked]).

The plan is I can write a code that can filter out the most important columns from a given data file. Flavio or Ana can write the following function:

def create_sequence(path: Str, imp_columns: List[Str], save_path: Str) This function should write a csv to the save_path that has the following:

  1. Every row should represent an event.
  2. Every row must have values for the following columns: RINPERSOON, daysSinceFirstEvent, Age daysSinceFirstEvent -> # of days that have passed since the oldest date we have. Currently this is 1971-12-30. We should figure out a place where this should be stored so that there is a single source of truth. Age -> # of years that have passed between this person's birthdate and this event's date

If the final csv has 10 columns, each event will have token for those 10 columns in the life-sequence even if some of those are Nans. So the final csv should really contain events of only type. If necessary, multiple csvs can be written from the same source based on the event type.

We need to figure out if all the rows in the raw data represent one event. If it does, we can essentially pick out the columns with "date" in the name as substring and use them for calculating time. But first, we need a better idea about how the data is stored in the raw files from CBS. Does it really contain one event per row?

tanzir5 commented 1 month ago

For reference, this is Lucas' code for SPOLISBUS:

https://github.com/odissei-lifecourse/life-sequencing-dutch/blob/0f29c3524149517a7c299472f4b53f04679bf362/pop2vec/evaluation/domain/spolisbus%20cleaning%20V2%20-%20sept%202023.R#L175

We may need to write a little bit of hardcoded stuff for individual files.

f-hafner commented 1 month ago