Exploratory Data Analysis Merge Issue

yankodavila commented 2 years ago

Hello I have been encountering an issue while running the lab. The Jupyter notebook 03.f1_analysis_EDA.ipynb has the following issue on cell number 5:

ValueError Traceback (most recent call last)

in ----> 1 df1 = pd.merge(races,results,how='inner',on=['raceId']) 2 df2 = pd.merge(df1,quali,how='inner',on=['raceId','driverId','constructorId']) 3 df3 = pd.merge(df2,drivers,how='inner',on=['driverId']) 4 df4 = pd.merge(df3,constructors,how='inner',on=['constructorId']) 5 df5 = pd.merge(df4,circuit,how='inner',on=['circuitId']) ~/redbullenv/lib64/python3.6/site-packages/pandas/core/reshape/merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate) 85 copy=copy, 86 indicator=indicator, ---> 87 validate=validate, 88 ) 89 return op.get_result() ~/redbullenv/lib64/python3.6/site-packages/pandas/core/reshape/merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate) 654 # validate the merge keys dtypes. We may need to coerce 655 # to avoid incompatible dtypes --> 656 self._maybe_coerce_merge_keys() 657 658 # If argument passed to validate, ~/redbullenv/lib64/python3.6/site-packages/pandas/core/reshape/merge.py in _maybe_coerce_merge_keys(self) 1163 inferred_right in string_types and inferred_left not in string_types 1164 ): -> 1165 raise ValueError(msg) 1166 1167 # datetimelikes must match exactly ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat I’m using an oracle automatic deployment provided by oracle as part of their environment. I do not have a lot of experience with Python but one possible ible solution is to read the numeric values form the csv file as integer or float but I’m almost certain the solution might be a little more elaborated than that 😉. Anyway thanks for your time. I’m really excited to test your solution and finish the lab. Thanks again.

mmeija commented 2 years ago

I ran into this also. Haven't been able to get the cause but I've found a workaround might be to change

races = pd.read_csv(r'./data_f1/races.csv')

to

races = pd.read_csv(r'./data_f1/races.csv', usecols=['raceId','year','round','circuitId','name','date','time','url','fp1_date','fp1_time','fp2_date','fp2_time','fp3_date','fp3_time','quali_date','quali_time','sprint_date','sprint_time']) or perhaps this depending on your races.csv columns races = pd.read_csv(r'./data_f1/races.csv', usecols=['raceId','year','round','circuitId','name','date','time','url'])

it seems to work for me after that I guess races.csv might be downloaded within provisioning on OCI. I think possibly it's changed. As you suggest It does seem the read_csv doesn't define the right column data types resulting in the merge failing. explicitly defining the columns seems to correct that for some reason

yankodavila commented 2 years ago

Hi @mmeija, thanks for your suggestion. I tried it, and it works for me. I guess all that is left to do is create a pull request. Thanks for your time.

jasperan commented 1 year ago

@yankodavila @mmeija I'm doing a periodic quality assurance of the workshop and will take your comments and fix into consideration!

oracle-devrel / redbull-analytics-hol

Exploratory Data Analysis Merge Issue #44