Open tanzir5 opened 1 month ago
For reference, this is Lucas' code for SPOLISBUS:
We may need to write a little bit of hardcoded stuff for individual files.
age
we should be able to join this information from demographic files of people.daysSinceFirstEvent
, whether we can compute this from the raw files themselves depends on what they report about calendar time.
We have a lot of columns in the raw data files we have. Till now Lucas has handpicked some columns and created sequences out of it which we have used. But the median length of the sequences are 30 tokens with the max being around 70 tokens.
From a computational resources limitation perspective, we are limited by the total length of a person's sequence (512 tokens) and the vocabulary size (# of unique tokens, which should be around 10k-20k [needs to be checked]).
The plan is I can write a code that can filter out the most important columns from a given data file. Flavio or Ana can write the following function:
def create_sequence(path: Str, imp_columns: List[Str], save_path: Str)
This function should write a csv to the save_path that has the following:If the final csv has 10 columns, each event will have token for those 10 columns in the life-sequence even if some of those are Nans. So the final csv should really contain events of only type. If necessary, multiple csvs can be written from the same source based on the event type.
We need to figure out if all the rows in the raw data represent one event. If it does, we can essentially pick out the columns with "date" in the name as substring and use them for calculating time. But first, we need a better idea about how the data is stored in the raw files from CBS. Does it really contain one event per row?