pgh-public-meetings / city-scrapers-pitt

Pittsburgh City Scrapers: sourcing public meetings in Pittsburgh
MIT License
19 stars 66 forks source link

Strip HTML from pgh_public_schools descriptions #195

Open ben-nathanson opened 3 years ago

ben-nathanson commented 3 years ago

Our pgh_public_schools spider is leaving some HTML in the event description. It would be nice to clean this up by stripping out the HTML. Here are some examples:

        "title": "Education and Student Performance Committee Meeting",
        "description": "<p>Conference Room A</p>\n<p>&nbsp;</p>\n<p><a href=\"\" target=\"_blank\" rel=\"noopener noreferrer\">Meeting Agenda</a></p>",
        "title": "Legislative Session",
        "description": "<p>Board Room</p>",
        "title": "Public Hearing",
        "description": "<p>Public Hearing &ndash; Monday, December 14, 2020&nbsp; -&nbsp; <a href=\"\"></a></p>",

In other spiders, we have had some success with this code snippet from StackOverflow. We might want to make this a reusable function somewhere in the project:

from io import StringIO
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.strict = False
        self.convert_charrefs= True
        self.text = StringIO()
    def handle_data(self, d):
    def get_data(self):
        return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    return s.get_data()