Closed f-hafner closed 1 day ago
I will check try to run these things on snellius and fix the problem there:
This problem goes deep. While we have a sqlite database with income and background, I think we need to create it differently
convert_data_to_sqlite.py
:person_gender.pkl
, person_birth_city.pkl
, person_birth_year.pkl
. I guess these files were created sometime in the past off tables provided by Lucas. background.csv
, used in his LLM. perhaps this could be the source for the database? income_baseline_by_year.pkl
generate_income_baseline.py
-- on line 42, person IDs are converted into int
. I guess this conversion is to comply with the data types used in the background data I think what I would like to have is also an overview of data are processed by Lucas and how they are reused for model training and evaluation. I know Tanzir has some docs on this. Need to check.
Also, what I don't understand
background.csv
and for instance job_yearly/paycheck_2014.csv
) have RINPERSOON as ints, while for instance mlm_encoded_upto_2017.h5
has RINPERSOON as string. The raw files were not changed since Nov 2023.paycheck_2013.csv
that I created in July 2024 that does have the RINPERSOON ids as strings; i particular, strings are of length 9 and left-padded with 0s. This file I created when investigating #58. So, does Tanzir's code convert ints to strings somewhere? and why was this not done on the raw files that were actually used to train the model?I talked to a colleague working with the same data. He has been using integers for the IDs, and this works and it's faster for querying/processing data than using strings. Maybe a reason to keep (most) data as they are, but it's still unclear why LLM code produces string identifiers, and why our joins don't work.
moreover, the overlap in IDs is also 0 when loading the embeddings from the network model. that's strange because we can match the income data that Dakota's evaluation uses with the network embedding data.
another issue is that the merge in line 439 of baseline_evaluation_v1.py joins on different data types in the real data: rinpersoon id is an integer in the background_df (as per the description of the background.csv data above), but a string in the income_df.
I could fix this with background_df['RINPERSOON'] = background_df['RINPERSOON'].astype(str).str.zfill(9)
This led to a merged_df
of about 13.4 mio instead of 12.06 mio. The fix still needs to be put into the code.
I guess we could do the same padding on the identifiers of the embeddings, or convert the income/background data that is joined to the embeddings to integer before joining with the embedding data. I probably won't have time to do this before Friday though.
@f-hafner When I was creating the embeddings, I consciously made all the RINPERSOON IDs as strings because that seemed more intuitive. But we can move on to keeping all RINPERSOON IDs as integers.
This creates a small issue though. In a plain csv file, it does not contain metadata about the dtypes of the columns. So, if we do not say anything pandas is always going to load a csv file's column with only numeric values as a numeric dtype column. But .sav files contain a lot of metadata. That's why I think the RINPERSOON ID in the income file are strings. The fix would be to convert the RINPERSOON ID to int while loading the income data.
But again, I do not understand completely why background_df['RINPERSOON'] = background_df['RINPERSOON'].astype(str).str.zfill(9) was necessary.
Did the sav file have leading zeros?
Also it's not correct that line 439 of baseline_evaluation_v1.py joins on different data types. While loading the background data, the RINPERSOON_ID is specified to be "object" in the code. Otherwise the code would not have run.
@f-hafner Please verify if the line 71 and line 58 for loading data is okay.
check_rinpersoon_int.py
passes
the income data in SAV have trailing 0s, that's why the zfill(9)
was necessary
with latest fixes:
inner join embeddings and merged_df gives around 4k rows; I think this is because 50% of people in the embeddings_df
don't have a record in merged_df
Check
These two together should explain the issue here I believe.
Tanzir and I had a problem when running his evaluation with the embeddings: we could not join the embeddings to the background data. the reason is that the background data are dtype
object
, while the embedding data areint
.I checked, and this is only in the embedding subsets, and not in the embeddings that are written during model inference. The transformation happens in line 67 of
extract_embedding_subsets.py
. I believe we do this because we need to filter the embedding data against the set of identifiers, ie,income_eval_subset.pkl
, written in Dakota's pipeline. These identifiers are stored as ints.It also seems that the transformation does something weird: the minimum value is much lower than the lowest value of identifiers in the background data. I don't know why this is.
I think we should be careful in how we set up our data, and in particular how we store our identifiers. We could tidy this through #90 , and make sure we all pull the data in the same way.