pgh-public-meetings / city-scrapers-pitt

Pittsburgh City Scrapers: sourcing public meetings in Pittsburgh
https://pgh-public-meetings.github.io/events/
MIT License
19 stars 66 forks source link

Strip HTML from pgh_public_schools descriptions #195

Open ben-nathanson opened 3 years ago

ben-nathanson commented 3 years ago

Our pgh_public_schools spider is leaving some HTML in the event description. It would be nice to clean this up by stripping out the HTML. Here are some examples:

[
    {
        "title": "Education and Student Performance Committee Meeting",
        "description": "<p>Conference Room A</p>\n<p>&nbsp;</p>\n<p><a href=\"https://www.pghschools.org/Page/1305\" target=\"_blank\" rel=\"noopener noreferrer\">Meeting Agenda</a></p>",
    ...
    },
    {
        "title": "Legislative Session",
        "description": "<p>Board Room</p>",
    ...
    },
{
        "title": "Public Hearing",
        "description": "<p>Public Hearing &ndash; Monday, December 14, 2020&nbsp; -&nbsp; <a href=\"https://livestream.com/accounts/7031315/events/9360881\">https://livestream.com/accounts/7031315/events/9360881</a></p>",
    ...
    },
]

In other spiders, we have had some success with this code snippet from StackOverflow. We might want to make this a reusable function somewhere in the project:

from io import StringIO
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        super().__init__()
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()