uchicago-capp-30320 / CivicLens

Putting the public back in public commenting
https://civic-lens.org
GNU Affero General Public License v3.0
2 stars 1 forks source link

clean html text #286

Open andrewjtdunn opened 5 months ago

andrewjtdunn commented 5 months ago

We sometimes have html characters in our text fields. Rather than writing specific regex expressions, perhaps there is a package that does this for us? Issue appears in comments and in summaries from the federal register

andrewjtdunn commented 5 months ago

initial googling:

https://stackoverflow.com/questions/9662346/python-code-to-remove-html-tags-from-a-string

https://lxml.de/elementsoup.html

jgibson517 commented 5 months ago

Django has a strip_tags functions that removes things like <\br> - https://docs.djangoproject.com/en/5.0/ref/utils/#django.utils.html.strip_tags

And python has a html.unescape that removes the other entities: https://docs.python.org/3/library/html.html