Closed Mr0grog closed 4 years ago
Also, it looks like Spanish entries are getting the date prefix in a slightly different format:
{
"id": "12656 at https://cmo.smcgov.org",
"url": "https://cmo.smcgov.org/press-release/17-de-septiembre-de-2020-%C2%BFqu%C3%A9-es-necesario-para-pasar-de-morado-rojo",
"title": "17 de septiembre de 2020: \u00bfQu\u00e9 es necesario para pasar de morado a rojo?",
"date_published": "2020-09-19T16:14:15Z"
}
Update: this is also happening with the summaries (most San Mateo news items don’t have summaries, but some do). In this case, though, there’s no delimiter between the date and the text. See this PR created by the San Mateo news scraper for an example: https://github.com/sfbrigade/stop-covid19-sfbayarea/pull/424#discussion_r495386129
{
"id": "12711 at https://cmo.smcgov.org",
"url": "https://cmo.smcgov.org/press-release/sept-25-2020-health-officer-order-prohibits-removal-fire-debris-burn-sites-pending",
"title": "Sept. 25, 2020: Health Officer Order Prohibits Removal of Fire Debris from Burn Sites Pending Development of State and Federal Process",
"date_published": "2020-09-25T22:05:08Z",
"summary": "Sept. 25, 2020San Mateo County Health Officer Dr. Scott Morrow has issued a health order prohibiting the unsafe removal, transport, and disposal of fire debris and other hazardous materials from structures burned during the CZU August Lightning Complex Fires without written permission from Environmental Health Services."
}
San Mateo’s news items have their title prefixed with a date, which is confusing and redundant since we already display the date before the title for news items anyway:
We currently have some code that used to remove these:
https://github.com/sfbrigade/data-covid19-sfbayarea/blob/238ef6d05ce45f2c7f08a3c39e716f12071cda57/covid19_sfbayarea/news/san_mateo.py#L19-L25
and:
https://github.com/sfbrigade/data-covid19-sfbayarea/blob/238ef6d05ce45f2c7f08a3c39e716f12071cda57/covid19_sfbayarea/news/san_mateo.py#L96-L97
but the format must have changed slightly so that regular expression no longer works.
If you run
python screaper_news.py san_mateo
, you now get entries like: