Closed Mr0grog closed 4 years ago
it doesn't appear as though we're grabbing anything earlier than Aug. 31, but I haven't done enough testing yet to figure out why that is. Is that by design?
Yep! You can adjust using the from --from
option on the command line: https://github.com/sfbrigade/data-covid19-sfbayarea/blob/cc89fe4b9bb346db10134ed09a6fe9e7cc88e778/scraper_news.py#L66-L69
We added that after font-end folks got concerned about there just being way too much, no-longer-relevant news from some counties.
Thanks for the speedy review! 🙇
Yep! You can adjust using the from
--from
option on the command line:
D'oh! Of course. I didn't spot the default value in there. Makes good sense.
Thanks for the speedy review! bow
For sure! :+1: Glad to be able to return the favor!
In this recent news update, you’ll notice that all the Contra Costa news from this month just disappeared: https://github.com/sfbrigade/stop-covid19-sfbayarea/pull/442/files
We already had some relatively complex space matching in some parts of Contra Costa regular expressions, but now they have dropped other complex space characters in new places that is causing the scraper to miss entire months' worth of news. To handle it, I’ve basically just made our headings and title matchers accept the more complex space character set anywhere they were previously using
\s
. 😩You can test this by running:
And make sure there are news items for September, and that the titles of most news items generally seem correct.