sfbrigade / data-covid19-sfbayarea

Manual and automated processes of sourcing data for the stop-covid19-sfbayarea project
MIT License
8 stars 10 forks source link

Marin County only shows data up to Feb 11 #203

Closed frhino closed 3 years ago

frhino commented 3 years ago

Describe the bug Appears to be a pagination issue where the data exists but we're just not accessing all of it from our data file.

To Reproduce Steps to reproduce the behavior:

  1. Go to https://panda.baybrigades.org/main
  2. Click on statistics
  3. Scroll down to county selector
  4. Choose Marin
  5. See data goes to Feb 11 and stops
  6. Go to https://coronavirus.marinhhs.org/surveillance
  7. Verify data goes to April 25 (as of April 28)

Expected behavior A clear and concise description of what you expected to happen.

Screenshots image

Additional context Add any other context about the problem here.

Mr0grog commented 3 years ago

Per discussion earlier this evening and in Slack, it looks like the problem is that the Marin module is not properly paging through the full set of data. It uses data.utils.SocrataApi.resource(), which is somewhat simplistic and just makes a single HTTP request, without trying to determine if it needs to page through more results:

https://github.com/sfbrigade/data-covid19-sfbayarea/blob/40a9779552d21803faef5a31be2ad129d9a13c9f/covid19_sfbayarea/data/utils.py#L51-L52

Instead, it should act like our other, slightly more complex API clients and automatically page through the full result set.

For examples of how we do this elsewhere, see:

ArcGIS: https://github.com/sfbrigade/data-covid19-sfbayarea/blob/40a9779552d21803faef5a31be2ad129d9a13c9f/covid19_sfbayarea/data/arcgis.py#L88-L105

CKAN: https://github.com/sfbrigade/data-covid19-sfbayarea/blob/40a9779552d21803faef5a31be2ad129d9a13c9f/covid19_sfbayarea/data/ckan.py#L47-L70

Mr0grog commented 3 years ago

Socrata pagination docs: https://dev.socrata.com/docs/paging.html

benghancock commented 3 years ago

It looks like there's a Python package available for interacting with the SODA API, called sodapy, which has a get_all() method to handle the pagination for a given data set. Thoughts on the pros and cons of using that, as opposed to our home-grown wrapper?

Mr0grog commented 3 years ago

As a short-term fix, I think it’s good to update out minimal client (less impact on the rest of the codebase needing to change to fit sodapy’s API) like you’re doing in #207, but switching to a more robust and maintained package would probably be a good follow-on!