jldbc / pybaseball

Pull current and historical baseball statistics using Python (Statcast, Baseball Reference, FanGraphs)
MIT License
1.23k stars 330 forks source link

batting_stats_range breaks when parsing through the date '2021-06-25' #218

Closed dk-sa1 closed 3 years ago

dk-sa1 commented 3 years ago

When using the batting_stats_range function, there is an issue when parsing through 6/25/2021.

When it breaks, you receive corrupt data. One such piece being the player José Abreu appearing as "José Abreu". As well as only receiving a couple rows of data ( As opposed to several hundred for a typical day of data).

Below are some code blocks that work and do not work.

Works: data = batting_stats_range("2021-06-25", "2021-06-27") data = batting_stats_range("2021-06-25", "2021-06-25")

Does NOT Work: data = batting_stats_range("2021-06-24", "2021-06-27") data = batting_stats_range("2021-06-24", "2021-06-25")

For some reason, you can start on 6/25 with no issues. But you cannot parse over, nor end on 6/25 without receiving corrupt data.

dk-sa1 commented 3 years ago

The issue arises for the same date in 2019.

bdilday commented 3 years ago

it looks like in the cases that the data gets truncated, beautiful soup can't use utf-8 so falls back on a different encoding, e.g.,

>>> from pybaseball.league_batting_stats import batting_stats_range, get_soup
>>> start_dt = end_dt = "2021-05-01"
>>> data = batting_stats_range(start_dt, end_dt)
>>> len(data)
334
>>> soup = get_soup(start_dt, end_dt)
>>> soup.original_encoding
'utf-8'
>>> 
>>> start_dt = end_dt = "2021-05-02"
>>> data = batting_stats_range(start_dt, end_dt)
>>> len(data)
10
>>> soup = get_soup(start_dt, end_dt)
>>> soup.original_encoding
'Windows-1252'
>>> 

this seems to happen when the page header or footer includes a link to https://fbref.com/es or https://fbref.de, because they include the characters ú and ß (in Fútbol and Fußball). so long story short this looks like inconsistent encoding between the header / footer and the main part of the page

because it doesnt depend on the data, the date where it happens isn't reproducible either

Closed by https://github.com/jldbc/pybaseball/pull/223