Some recent changes broke the data characterization script, users were getting the following error when running against a DB or a CSV:
$ py data_analysis.py -s v2 --db postgresql://codi:codi@localhost/final_site_d
Traceback (most recent call last):
File "/Users/dehall/data-owner-tools-review/data_analysis.py", line 207, in <module>
results = analyze(db_data, args.schema)
File "/Users/dehall/data-owner-tools-review/data_analysis.py", line 41, in analyze
record_id_col = case_insensitive_lookup(data, "record_id", source)
File "/Users/dehall/data-owner-tools-review/utils/data_reader.py", line 126, in case_insensitive_lookup
return data if (data != "") else None
File "/usr/local/lib/python3.9/site-packages/pandas/core/generic.py", line 1537, in __nonzero__
raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
This PR tweaks the case_insensitive_lookup function to better support being called from data_analysis.py. It should still work fine when called from extract.py. The key difference is that when called from data_analysis, row is a pandas DataFrame, and row[key] is a pandas Series, as compared to from extract, where row is a plain dict, and row[key] is a string. The changes here should allow for concatenating either two strings or two Series of strings -- specifically the + operator works on both.
The change to data_analysis.py is to prevent this warning:
data_analysis.py:193: DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
Some recent changes broke the data characterization script, users were getting the following error when running against a DB or a CSV:
This PR tweaks the
case_insensitive_lookup
function to better support being called fromdata_analysis.py
. It should still work fine when called fromextract.py
. The key difference is that when called from data_analysis,row
is a pandas DataFrame, androw[key]
is a pandas Series, as compared to from extract, whererow
is a plain dict, androw[key]
is a string. The changes here should allow for concatenating either two strings or two Series of strings -- specifically the+
operator works on both.The change to
data_analysis.py
is to prevent this warning: