pgh-public-meetings / city-scrapers-pitt

Pittsburgh City Scrapers: sourcing public meetings in Pittsburgh
https://pgh-public-meetings.github.io/events/
MIT License
19 stars 66 forks source link

Common string cleaning utilities and an example of their use #204

Closed ben-nathanson closed 3 years ago

ben-nathanson commented 3 years ago

A majority of spiders in this project are attempting to parse HTML strings and convert them into something more human-readable. The conversion step has been a persistent pain point, it's time consuming, we're consistently getting this wrong, and there exists many implementations of the same thing throughout our codebase 😖 .

This new common code should allow us to refactor. Once a contributor drills down to a single div, e.g. <p>January 21, 2021\xa0</p>, this cleaning function should do the job 9 times out of 10.

Finally, we implement an example of the first common string cleaning function by refactoring alle_improvements.