toddrob99 / MLB-StatsAPI

Python wrapper for MLB Stats API
GNU General Public License v3.0
513 stars 98 forks source link

Normalize diacritics in player lookup #138

Open ap-ack opened 2 months ago

ap-ack commented 2 months ago

statsapi.lookup_player('Acuna') returns an empty list, not including Ronald Acuña Jr. because it is not a direct text match due to the diacritic on the n. This effectively makes the player lookup feature unusable for search queries coming from user input, as it may not be obvious to an end user that they need to use characters like this and they may not have easy access to these characters on their keyboard. Additionally, some MLB players like Emmanuel Ramirez don't use diacritics in their names, while others like José Ramírez do, and users won't know when they have to use the diacritics to get their expected search results.

toddrob99 commented 2 months ago

I am considering normalizing the accented and diacritical characters using a method I found here. This seems to slow down the lookup process considerably, because it has to check and normalize every character in every field of every player record. I am going to think about this and do some more testing.

ap-ack commented 2 months ago

If that is not performant enough how about adding an attribute to the player object like 'normalizedFullName' so the result would be pre-computed? (edit: or does that object come from the underlying API?)

toddrob99 commented 2 months ago

I realized last night that the API call is only requesting specific fields, so I checked if there's another field that includes the ascii version of the name, and there is one: 'nameSlug': 'ronald-acuna-jr-660670'. I'll add that field to the API call in the next version, and the issue will be resolved (because the current code checks all included fields for the keyword).

Full record, for reference:

{'id': 660670, 'fullName': 'Ronald Acuña Jr.', 'link': '/api/v1/people/660670', 'firstName': 'Ronald', 'lastName': 'Acuña', 'primaryNumber': '13', 'birthDate': '1997-12-18', 'currentAge': 26, 'birthCity': 'La Guaira', 'birthCountry': 'Venezuela', 'height': '6\' 0"', 'weight': 205, 'active': True, 'currentTeam': {'id': 144, 'link': '/api/v1/teams/144'}, 'primaryPosition': {'code': '9', 'name': 'Outfielder', 'type': 'Outfielder', 'abbreviation': 'RF'}, 'useName': 'Ronald', 'useLastName': 'Acuña Jr.', 'middleName': 'Jose', 'boxscoreName': 'Acuña Jr.', 'nickName': 'El De La Sabana', 'gender': 'M', 'nameMatrilineal': 'Blanco', 'isPlayer': True, 'isVerified': True, 'pronunciation': 'ah-cuhn-YA', 'mlbDebutDate': '2018-04-25', 'batSide': {'code': 'R', 'description': 'Right'}, 'pitchHand': {'code': 'R', 'description': 'Right'}, 'nameFirstLast': 'Ronald Acuña Jr.', 'nameTitle': 'Jr.', 'nameSuffix': 'Jr.', 'nameSlug': 'ronald-acuna-jr-660670', 'firstLastName': 'Ronald Acuña Jr.', 'lastFirstName': 'Acuña Jr., Ronald', 'lastInitName': 'Acuña Jr., R', 'initLastName': 'R Acuña Jr.', 'fullFMLName': 'Ronald Jose Acuña Jr.', 'fullLFMName': 'Acuña Jr., Ronald Jose', 'strikeZoneTop': 3.49, 'strikeZoneBottom': 1.58}

toddrob99 commented 2 months ago

I forgot to mention this issue in the commit/PR but this is fixed in v1.7.2

ap-ack commented 2 months ago

Great, thanks. 'Acuna' now works but 'Ronald Acuna' still doesn't. Would it make sense to split the input and search for the individual words?

ap-ack commented 2 months ago

If you agree on splitting the search query, want me to put in a PR? Would you also want a parameter for the function allowing the caller to choose what should qualify as a match? Options could be all words, any words, or full string.

The "match all words" case would be just about as performant as current, since in most cases the first word will not match and we will move on, but the "match any words" case would require more loops linearly with the number of words (again assuming most players are not matches).