mitre / data-owner-tools

Tools for the Childhood Obesity Data Initiative (CODI) data owners and partners to use in record linkage
Apache License 2.0
5 stars 8 forks source link

Fix data characterization script #45

Closed dehall closed 1 year ago

dehall commented 2 years ago

Some recent changes broke the data characterization script, users were getting the following error when running against a DB or a CSV:

$ py data_analysis.py -s v2 --db postgresql://codi:codi@localhost/final_site_d
Traceback (most recent call last):
  File "/Users/dehall/data-owner-tools-review/data_analysis.py", line 207, in <module>
    results = analyze(db_data, args.schema)
  File "/Users/dehall/data-owner-tools-review/data_analysis.py", line 41, in analyze
    record_id_col = case_insensitive_lookup(data, "record_id", source)
  File "/Users/dehall/data-owner-tools-review/utils/data_reader.py", line 126, in case_insensitive_lookup
    return data if (data != "") else None
  File "/usr/local/lib/python3.9/site-packages/pandas/core/generic.py", line 1537, in __nonzero__
    raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

This PR tweaks the case_insensitive_lookup function to better support being called from data_analysis.py. It should still work fine when called from extract.py. The key difference is that when called from data_analysis, row is a pandas DataFrame, and row[key] is a pandas Series, as compared to from extract, where row is a plain dict, and row[key] is a string. The changes here should allow for concatenating either two strings or two Series of strings -- specifically the + operator works on both.

The change to data_analysis.py is to prevent this warning:

data_analysis.py:193: DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.