sfbrigade / data-covid19-sfbayarea

Manual and automated processes of sourcing data for the stop-covid19-sfbayarea project
MIT License
8 stars 10 forks source link

HOTFIX: Allow special spaces in more Contra Costa strings #138

Closed Mr0grog closed 4 years ago

Mr0grog commented 4 years ago

In this recent news update, you’ll notice that all the Contra Costa news from this month just disappeared: https://github.com/sfbrigade/stop-covid19-sfbayarea/pull/442/files

We already had some relatively complex space matching in some parts of Contra Costa regular expressions, but now they have dropped other complex space characters in new places that is causing the scraper to miss entire months' worth of news. To handle it, I’ve basically just made our headings and title matchers accept the more complex space character set anywhere they were previously using \s. 😩

You can test this by running:

# If you use the shell script:
$ ./run_scraper_news.sh contra_costa

# Or if you manage virtualenvs manually:
$ python scraper_news.py contra_costa

And make sure there are news items for September, and that the titles of most news items generally seem correct.

Mr0grog commented 4 years ago

it doesn't appear as though we're grabbing anything earlier than Aug. 31, but I haven't done enough testing yet to figure out why that is. Is that by design?

Yep! You can adjust using the from --from option on the command line: https://github.com/sfbrigade/data-covid19-sfbayarea/blob/cc89fe4b9bb346db10134ed09a6fe9e7cc88e778/scraper_news.py#L66-L69

We added that after font-end folks got concerned about there just being way too much, no-longer-relevant news from some counties.

Mr0grog commented 4 years ago

Thanks for the speedy review! 🙇

benghancock commented 4 years ago

Yep! You can adjust using the from --from option on the command line:

D'oh! Of course. I didn't spot the default value in there. Makes good sense.

benghancock commented 4 years ago

Thanks for the speedy review! bow

For sure! :+1: Glad to be able to return the favor!