jldbc / pybaseball

Pull current and historical baseball statistics using Python (Statcast, Baseball Reference, FanGraphs)
MIT License
1.18k stars 322 forks source link

Remove duplicates in playerid_lookup fuzzy search #373

Open mhmills opened 11 months ago

mhmills commented 11 months ago

As mentioned in #358, if you were to search playerid_lookup("tatis", "fernando", fuzzy=True) right now, you would get duplicate rows for Fernando Tatís Jr and Sr. This is because fuzzy=True and the search doesn't produce an exact match because the correct name is Tatís with the accented í, not Tatis. Since the Chadwick names for Tatís Jr and Sr are the same, 'Fernando Tatís' is 2/5 names in fuzzy_matches when the merge is done with the player table in get_closest_names(). Each copy of the name matches with the table data for Tatís Jr and Sr, so we get duplicates for each.

The change I made was to drop the duplicate name before the merge (making the length of fuzzy_matches 4 not 5), so now the single copy of the name can match data for both Jr and Sr. Since the one copy of the name matches data for both players, we still end up returning 5 players after the merge as expected. The same effect can be seen if you were to do a fuzzy search for Vladimir Guerrero Jr and Sr, such as playerid_lookup("guerrero", "vladimi", fuzzy=True).