sfbrigade / data-covid19-sfbayarea

Manual and automated processes of sourcing data for the stop-covid19-sfbayarea project
MIT License
8 stars 10 forks source link

Santa Clara data is failing with dates from 1920/1921 #194

Closed Mr0grog closed 3 years ago

Mr0grog commented 3 years ago

The Santa Clara County data scraper is currently failing because the county’s dataset has a date from 1921:

Santa Clara county failed: Unknown date format: "1921-02-10T00:00:00.000"
Traceback (most recent call last):
 File "/home/runner/work/stop-covid19-sfbayarea/stop-covid19-sfbayarea/scraper/scraper_data.py", line 30, in main
   out[county] = data_scrapers.scrapers[county].get_county()
 File "/home/runner/work/stop-covid19-sfbayarea/stop-covid19-sfbayarea/scraper/covid19_sfbayarea/data/santa_clara.py", line 64, in get_county
   'deaths': get_timeseries_deaths(api),
 File "/home/runner/work/stop-covid19-sfbayarea/stop-covid19-sfbayarea/scraper/covid19_sfbayarea/data/santa_clara.py", line 95, in get_timeseries_deaths
   return [
 File "/home/runner/work/stop-covid19-sfbayarea/stop-covid19-sfbayarea/scraper/covid19_sfbayarea/data/santa_clara.py", line 97, in <listcomp>
   'date': parse_datetime(entry['date']).date().isoformat(),
 File "/home/runner/work/stop-covid19-sfbayarea/stop-covid19-sfbayarea/scraper/covid19_sfbayarea/utils.py", line 48, in parse_datetime
   raise ValueError(f'Unknown date format: "{date_string}"')
ValueError: Unknown date format: "1921-02-10T00:00:00.000"

We’ve seen similar issues before with dates from 1920 as well. They usually get resolved by the county, but it can take several days or more for that to happen, and it’s stupid for the scraper to be broken for that whole time.

I think it’s relatively safe to assume someone accidentally entered 2-digit year in some form where a 4-digit one was required, so we can safely change these to 2020 or 2021.

Mr0grog commented 3 years ago

Aaaaaaaand now that I finally filed this issue after it failing for a week, they seem to have fixed it. 🙄

This is still worth addressing, though — we’ve seen it before and will probably see it again. The best spot is probably to change utils.parse_datetime(): https://github.com/sfbrigade/data-covid19-sfbayarea/blob/24269d83237bdbae9c632e71365b79426f6a1e02/covid19_sfbayarea/utils.py#L31-L50