odissei-lifecourse / life-sequencing-dutch

MIT License
0 stars 0 forks source link

person ID data types for evaluation #92

Closed f-hafner closed 1 day ago

f-hafner commented 1 week ago

Tanzir and I had a problem when running his evaluation with the embeddings: we could not join the embeddings to the background data. the reason is that the background data are dtype object, while the embedding data are int.

I checked, and this is only in the embedding subsets, and not in the embeddings that are written during model inference. The transformation happens in line 67 of extract_embedding_subsets.py. I believe we do this because we need to filter the embedding data against the set of identifiers, ie, income_eval_subset.pkl, written in Dakota's pipeline. These identifiers are stored as ints.

It also seems that the transformation does something weird: the minimum value is much lower than the lowest value of identifiers in the background data. I don't know why this is.

I think we should be careful in how we set up our data, and in particular how we store our identifiers. We could tidy this through #90 , and make sure we all pull the data in the same way.

f-hafner commented 1 week ago

I will check try to run these things on snellius and fix the problem there:

f-hafner commented 1 week ago

This problem goes deep. While we have a sqlite database with income and background, I think we need to create it differently

From looking at convert_data_to_sqlite.py:

the origin of the backround data in there is not traceable

the income data written to the database come from income_baseline_by_year.pkl

I think what I would like to have is also an overview of data are processed by Lucas and how they are reused for model training and evaluation. I know Tanzir has some docs on this. Need to check.

Also, what I don't understand

f-hafner commented 1 week ago

I talked to a colleague working with the same data. He has been using integers for the IDs, and this works and it's faster for querying/processing data than using strings. Maybe a reason to keep (most) data as they are, but it's still unclear why LLM code produces string identifiers, and why our joins don't work.

f-hafner commented 1 week ago

moreover, the overlap in IDs is also 0 when loading the embeddings from the network model. that's strange because we can match the income data that Dakota's evaluation uses with the network embedding data.

f-hafner commented 1 week ago

another issue is that the merge in line 439 of baseline_evaluation_v1.py joins on different data types in the real data: rinpersoon id is an integer in the background_df (as per the description of the background.csv data above), but a string in the income_df. I could fix this with background_df['RINPERSOON'] = background_df['RINPERSOON'].astype(str).str.zfill(9) This led to a merged_df of about 13.4 mio instead of 12.06 mio. The fix still needs to be put into the code.

I guess we could do the same padding on the identifiers of the embeddings, or convert the income/background data that is joined to the embeddings to integer before joining with the embedding data. I probably won't have time to do this before Friday though.

tanzir5 commented 6 days ago

@f-hafner When I was creating the embeddings, I consciously made all the RINPERSOON IDs as strings because that seemed more intuitive. But we can move on to keeping all RINPERSOON IDs as integers.

This creates a small issue though. In a plain csv file, it does not contain metadata about the dtypes of the columns. So, if we do not say anything pandas is always going to load a csv file's column with only numeric values as a numeric dtype column. But .sav files contain a lot of metadata. That's why I think the RINPERSOON ID in the income file are strings. The fix would be to convert the RINPERSOON ID to int while loading the income data.

But again, I do not understand completely why background_df['RINPERSOON'] = background_df['RINPERSOON'].astype(str).str.zfill(9) was necessary.

Did the sav file have leading zeros?

Also it's not correct that line 439 of baseline_evaluation_v1.py joins on different data types. While loading the background data, the RINPERSOON_ID is specified to be "object" in the code. Otherwise the code would not have run.

tanzir5 commented 6 days ago

@f-hafner Please verify if the line 71 and line 58 for loading data is okay.

https://github.com/odissei-lifecourse/life-sequencing-dutch/commit/c09d0fbd211181e16e3308672510616de1953d77 line

f-hafner commented 6 days ago

check_rinpersoon_int.py passes

f-hafner commented 6 days ago

the income data in SAV have trailing 0s, that's why the zfill(9) was necessary

f-hafner commented 6 days ago

with latest fixes: inner join embeddings and merged_df gives around 4k rows; I think this is because 50% of people in the embeddings_df don't have a record in merged_df

tanzir5 commented 1 day ago

Check

  1. https://github.com/odissei-lifecourse/life-sequencing-dutch/issues/80
  2. https://github.com/odissei-lifecourse/data/blob/main/info/income_info.md

These two together should explain the issue here I believe.