person ID data types for evaluation

f-hafner commented 1 week ago

Tanzir and I had a problem when running his evaluation with the embeddings: we could not join the embeddings to the background data. the reason is that the background data are dtype object, while the embedding data are int.

I checked, and this is only in the embedding subsets, and not in the embeddings that are written during model inference. The transformation happens in line 67 of extract_embedding_subsets.py. I believe we do this because we need to filter the embedding data against the set of identifiers, ie, income_eval_subset.pkl, written in Dakota's pipeline. These identifiers are stored as ints.

It also seems that the transformation does something weird: the minimum value is much lower than the lowest value of identifiers in the background data. I don't know why this is.

I think we should be careful in how we set up our data, and in particular how we store our identifiers. We could tidy this through #90 , and make sure we all pull the data in the same way.

f-hafner commented 1 week ago

I will check try to run these things on snellius and fix the problem there:

create the sqlite database
create the embedding subsets

f-hafner commented 1 week ago

This problem goes deep. While we have a sqlite database with income and background, I think we need to create it differently

From looking at `convert_data_to_sqlite.py`:

the origin of the backround data in there is not traceable

person_gender.pkl, person_birth_city.pkl, person_birth_year.pkl. I guess these files were created sometime in the past off tables provided by Lucas.
[ ] need to check the datatypes in the pickle files
[ ] the best way to solve this is to take the original table from Lucas and dump it into the database directly. Tanzir's evaluation uses background data from another file, background.csv, used in his LLM. perhaps this could be the source for the database?

the income data written to the database come from `income_baseline_by_year.pkl`

this file is created in generate_income_baseline.py -- on line 42, person IDs are converted into int. I guess this conversion is to comply with the data types used in the background data
Tanzir's evaluation does its own processing of the raw income data, but I assume Lucas has put some thought into this already when preparing the data. So, I don't see why we don't take a table from him and use that. It would save us work to just re-use those data for anything we do

I think what I would like to have is also an overview of data are processed by Lucas and how they are reused for model training and evaluation. I know Tanzir has some docs on this. Need to check.

Also, what I don't understand

in the actual data, the raw files from Lucas (input to Tanzir's pipeline, ie background.csv and for instance job_yearly/paycheck_2014.csv) have RINPERSOON as ints, while for instance mlm_encoded_upto_2017.h5 has RINPERSOON as string. The raw files were not changed since Nov 2023.
there is a single file paycheck_2013.csv that I created in July 2024 that does have the RINPERSOON ids as strings; i particular, strings are of length 9 and left-padded with 0s. This file I created when investigating #58. So, does Tanzir's code convert ints to strings somewhere? and why was this not done on the raw files that were actually used to train the model?

f-hafner commented 1 week ago

I talked to a colleague working with the same data. He has been using integers for the IDs, and this works and it's faster for querying/processing data than using strings. Maybe a reason to keep (most) data as they are, but it's still unclear why LLM code produces string identifiers, and why our joins don't work.

f-hafner commented 1 week ago

moreover, the overlap in IDs is also 0 when loading the embeddings from the network model. that's strange because we can match the income data that Dakota's evaluation uses with the network embedding data.

f-hafner commented 1 week ago

another issue is that the merge in line 439 of baseline_evaluation_v1.py joins on different data types in the real data: rinpersoon id is an integer in the background_df (as per the description of the background.csv data above), but a string in the income_df. I could fix this with background_df['RINPERSOON'] = background_df['RINPERSOON'].astype(str).str.zfill(9) This led to a merged_df of about 13.4 mio instead of 12.06 mio. The fix still needs to be put into the code.

I guess we could do the same padding on the identifiers of the embeddings, or convert the income/background data that is joined to the embeddings to integer before joining with the embedding data. I probably won't have time to do this before Friday though.

tanzir5 commented 6 days ago

@f-hafner When I was creating the embeddings, I consciously made all the RINPERSOON IDs as strings because that seemed more intuitive. But we can move on to keeping all RINPERSOON IDs as integers.

This creates a small issue though. In a plain csv file, it does not contain metadata about the dtypes of the columns. So, if we do not say anything pandas is always going to load a csv file's column with only numeric values as a numeric dtype column. But .sav files contain a lot of metadata. That's why I think the RINPERSOON ID in the income file are strings. The fix would be to convert the RINPERSOON ID to int while loading the income data.

But again, I do not understand completely why background_df['RINPERSOON'] = background_df['RINPERSOON'].astype(str).str.zfill(9) was necessary.

Did the sav file have leading zeros?

Also it's not correct that line 439 of baseline_evaluation_v1.py joins on different data types. While loading the background data, the RINPERSOON_ID is specified to be "object" in the code. Otherwise the code would not have run.

tanzir5 commented 6 days ago

@f-hafner Please verify if the line 71 and line 58 for loading data is okay.

https://github.com/odissei-lifecourse/life-sequencing-dutch/commit/c09d0fbd211181e16e3308672510616de1953d77 line

f-hafner commented 6 days ago

check_rinpersoon_int.py passes

f-hafner commented 6 days ago

the income data in SAV have trailing 0s, that's why the zfill(9) was necessary

f-hafner commented 6 days ago

with latest fixes: inner join embeddings and merged_df gives around 4k rows; I think this is because 50% of people in the embeddings_df don't have a record in merged_df

tanzir5 commented 1 day ago

Check

These two together should explain the issue here I believe.

odissei-lifecourse / life-sequencing-dutch