sfbrigade / data-covid19-sfbayarea

Manual and automated processes of sourcing data for the stop-covid19-sfbayarea project
MIT License
8 stars 10 forks source link

San Mateo news titles are prefixed with a date #134

Closed Mr0grog closed 4 years ago

Mr0grog commented 4 years ago

San Mateo’s news items have their title prefixed with a date, which is confusing and redundant since we already display the date before the title for news items anyway:

Screen Shot 2020-09-24 at 10 00 37 PM

We currently have some code that used to remove these:

https://github.com/sfbrigade/data-covid19-sfbayarea/blob/238ef6d05ce45f2c7f08a3c39e716f12071cda57/covid19_sfbayarea/news/san_mateo.py#L19-L25

and:

https://github.com/sfbrigade/data-covid19-sfbayarea/blob/238ef6d05ce45f2c7f08a3c39e716f12071cda57/covid19_sfbayarea/news/san_mateo.py#L96-L97

but the format must have changed slightly so that regular expression no longer works.

If you run python screaper_news.py san_mateo, you now get entries like:

    {
      "id": "12701 at https://cmo.smcgov.org",
      "url": "https://cmo.smcgov.org/press-release/sept-24-2020-safely-join-vote-center-team-new-health-officer-statement-testing",
      "title": "Sept. 24, 2020: Safely Join the Vote Center Team; New Health Officer Statement; Testing Opportunities: Update on County Response to COVID-19",
      "date_published": "2020-09-24T20:44:03Z"
    }
Mr0grog commented 4 years ago

Also, it looks like Spanish entries are getting the date prefix in a slightly different format:

{
    "id": "12656 at https://cmo.smcgov.org",
    "url": "https://cmo.smcgov.org/press-release/17-de-septiembre-de-2020-%C2%BFqu%C3%A9-es-necesario-para-pasar-de-morado-rojo",
    "title": "17 de septiembre de 2020: \u00bfQu\u00e9 es necesario para pasar de morado a rojo?",
    "date_published": "2020-09-19T16:14:15Z"
}
Mr0grog commented 4 years ago

Update: this is also happening with the summaries (most San Mateo news items don’t have summaries, but some do). In this case, though, there’s no delimiter between the date and the text. See this PR created by the San Mateo news scraper for an example: https://github.com/sfbrigade/stop-covid19-sfbayarea/pull/424#discussion_r495386129

{
    "id": "12711 at https://cmo.smcgov.org",
    "url": "https://cmo.smcgov.org/press-release/sept-25-2020-health-officer-order-prohibits-removal-fire-debris-burn-sites-pending",
    "title": "Sept. 25, 2020: Health Officer Order Prohibits Removal of Fire Debris from Burn Sites Pending Development of State and Federal Process",
    "date_published": "2020-09-25T22:05:08Z",
    "summary": "Sept. 25, 2020San Mateo County Health Officer Dr. Scott Morrow has issued a health order prohibiting the unsafe removal, transport, and disposal of fire debris and other hazardous materials from structures burned during the CZU August Lightning Complex Fires without written permission from Environmental Health Services."
}