progval / Limnoria

A robust, full-featured, and user/programmer-friendly Python IRC bot, with many existing plugins.
https://docs.limnoria.net/
Other
619 stars 173 forks source link

RSS feature request: sanitize/filter out HTML from `$description` #1539

Open Mikaela opened 1 year ago

Mikaela commented 1 year ago

The $description of many RSS feeds (e.g. GitHub, GitLab, crt.sh, Tor blog) contain HTML tags making them messy to read.

2023-W21-4 00:25:03 +0300 <@R-66Y> https://blog.torproject.org/new-alpha-release-tor-browser-125a6/ torproject: New Alpha Release: Tor Browser 12.5a6 (Android, Windows, macOS, Linux) <article class="blog-post"> <source media="(min-width:415px)" type="image/webp" /> <source type="image/webp" /> <img class="lead" src="https://blog.torproject.org/new-alpha-release-tor-browser-125a6/lead.png" /> <div class="body"><p>Tor Browser 12.5a6

I think Limnoria cleaning them up and just sending the user visible text would improve readability and thus usability of the plugin a lot.

While it's a different protocol and different capabilities, the Matrix bot Hookshot has this ability, https://github.com/matrix-org/matrix-hookshot/pull/738

Possibly related:

progval commented 1 year ago

I assume the example you submitted to https://github.com/matrix-org/matrix-hookshot/issues/732 is from https://bodhi.fedoraproject.org/rss/updates/ and the one you have here is from https://blog.torproject.org/feed.xml

In both of these feeds, the description does not contain HTML tags, but escaped HTML tags. For example, respectively:

<item><title>libphidget22-1.15.20230526-1.fc39</title><link>https://bodhi.fedoraproject.org/updates/FEDORA-2023-ffb20eb9af</link><description>&lt;h1&gt;FEDORA-2023-ffb20eb9af&lt;/h1&gt;
&lt;h2&gt;Packages in this update:&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;libphidget22-1.15.20230526-1.fc39&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Update description:&lt;/h2&gt;
&lt;p&gt;Automatic update for libphidget22-1.15.20230526-1.fc39.&lt;/p&gt;
&lt;h5&gt;&lt;strong&gt;Changelog&lt;/strong&gt;&lt;/h5&gt;
&lt;pre&gt;&lt;code&gt;* Mon May 29 2023 Richard Shaw &amp;lt;&lt;a href="mailto:hobbes1069@gmail.com"&gt;hobbes1069@gmail.com&lt;/a&gt;&amp;gt; - 1.15.20230526-1
- Update to 1.15.20230526.

&lt;/code&gt;&lt;/pre&gt;</description><pubDate>Mon, 29 May 2023 12:06:21 +0000</pubDate></item>

and

<entry><title>New Alpha Release: Tor Browser 12.5a6 (Android, Windows, macOS, Linux)</title><link href="https://blog.torproject.org/new-alpha-release-tor-browser-125a6/" rel="alternate"></link><updated>2023-05-24T00:00:00Z</updated><author><name>richard</name></author><id>urn:uuid:3d4a5097-1fc1-35ce-960d-7c29c6d28676</id><content type="html">&lt;article class="blog-post"&gt;
    &lt;picture&gt;
      &lt;source media="(min-width:415px)" srcset="https://blog.torproject.org/new-alpha-release-tor-browser-125a6/lead.webp" type="image/webp"&gt;
&lt;source srcset="https://blog.torproject.org/new-alpha-release-tor-browser-125a6/lead_small.webp" type="image/webp"&gt;

      &lt;img class="lead" referrerpolicy="no-referrer" loading="lazy" src="https://blog.torproject.org/new-alpha-release-tor-browser-125a6/lead.png"&gt;
    &lt;/picture&gt;
    &lt;div class="body"&gt;&lt;p&gt;Tor Browser 12.5a6 is now available from the &lt;a href="https://www.torproject.org/download/alpha/"&gt;Tor Browser download page&lt;/a&gt; and also from our &lt;a href="https://www.torproject.org/dist/torbrowser/12.5a6/"&gt;distribution directory&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This release updates Firefox 102.11.0esr, including bug fixes, stability improvements and important &lt;a href="https://www.mozilla.org/en-US/security/advisories/mfsa2023-17/"&gt;security updates&lt;/a&gt;. There were no Android-specific security updates to backport from the Firefox 113 release.&lt;/p&gt;
&lt;h2&gt;Build-Signing Infrastructure Updates&lt;/h2&gt;
&lt;p&gt;We are in the process of updating our build signing infrastructure, and unfortunately are unable to ship code-signed 12.5a6 installers for Windows systems currently. Therefore we will not be providing full Window installers for this release. However, automatic build-to-build upgrades from 12.5a4 and 12.5a5 should continue to work as expected.&lt;/p&gt;
&lt;h2&gt;Full changelog&lt;/h2&gt;
&lt;p&gt;The full changelog since &lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser-build/-/raw/main/projects/browser/Bundle-Data/Docs/ChangeLog.txt"&gt;Tor Browser 12.5a5&lt;/a&gt; is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;All Platforms&lt;ul&gt;
&lt;li&gt;Updated Translations&lt;/li&gt;
&lt;li&gt;Updated Go to 11.9.9&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser-build/-/issues/40860"&gt;Bug tor-browser-build#40860&lt;/a&gt;: Improve the transition from the old fontconfig file to the new one&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41728"&gt;Bug tor-browser#41728&lt;/a&gt;: Pin bridges.torproject.org domains to Let's Encrypt's root cert public key&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41738"&gt;Bug tor-browser#41738&lt;/a&gt;: Replace the patch to disable live reload with its preference&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41757"&gt;Bug tor-browser#41757&lt;/a&gt;: Rebase Tor Browser Alpha to 102.11.0esr&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41763"&gt;Bug tor-browser#41763&lt;/a&gt;: TTP-02-003 WP1: Data URI allows JS execution despite safest security level (Low)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41764"&gt;Bug tor-browser#41764&lt;/a&gt;: TTP-02-004 OOS: No user-activation required to download files (Low)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41775"&gt;Bug tor-browser#41775&lt;/a&gt;: Avoid re-defining some macros in nsUpdateDriver.cpp&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Windows + macOS + Linux&lt;ul&gt;
&lt;li&gt;Updated Firefox to 102.11esr&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41607"&gt;Bug tor-browser#41607&lt;/a&gt;: Update "New Circuit" icon&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41736"&gt;Bug tor-browser#41736&lt;/a&gt;: Customize the default CustomizableUI toolbar using CustomizableUI.jsm&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41770"&gt;Bug tor-browser#41770&lt;/a&gt;: Keyboard navigation broken leaving the toolbar tor circuit button&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41777"&gt;Bug tor-browser#41777&lt;/a&gt;: Internally shippped manual does not adapt to RTL languages (it always align to the left)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Windows + Linux&lt;ul&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41654"&gt;Bug tor-browser#41654&lt;/a&gt;: UpdateInfo jumped into Data&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Linux&lt;ul&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41732"&gt;Bug tor-browser#41732&lt;/a&gt;: implement linux font whitelist as defense-in-depth&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/41776"&gt;Bug tor-browser#41776&lt;/a&gt;: System fonts are temporarily leaked on Linux after the browser is updated from 12.5a4 or earlier&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Android&lt;ul&gt;
&lt;li&gt;Updated GeckoView to 102.11esr&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Build System&lt;ul&gt;
&lt;li&gt;All Platforms&lt;ul&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser-build/-/issues/33953"&gt;Bug tor-browser-build#33953&lt;/a&gt;: Provide a way for easily updating Go dependencies of projects&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser-build/-/issues/40673"&gt;Bug tor-browser-build#40673&lt;/a&gt;: Avoid building each go module separately&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser-build/-/issues/40818"&gt;Bug tor-browser-build#40818&lt;/a&gt;: Enable wasm target for rust compiler&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser-build/-/issues/40841"&gt;Bug tor-browser-build#40841&lt;/a&gt;: Adapt signing scripts to new signing machines&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser-build/-/issues/40849"&gt;Bug tor-browser-build#40849&lt;/a&gt;: Move Go dependencies to the projects dependent on them, not as a standalone projects&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser-build/-/issues/40856"&gt;Bug tor-browser-build#40856&lt;/a&gt;: Unblock nightly builds&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Windows&lt;ul&gt;
&lt;li&gt;&lt;a href="https://gitlab.torproject.org/tpo/applications/tor-browser-build/-/issues/40846"&gt;Bug tor-browser-build#40846&lt;/a&gt;: Temporarily disable Windows signing&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

    &lt;/div&gt;
  &lt;div class="categories"&gt;
    &lt;ul&gt;&lt;li&gt;
        &lt;a href="https://blog.torproject.org/../category/applications"&gt;
          applications
        &lt;/a&gt;
      &lt;/li&gt;&lt;li&gt;
        &lt;a href="https://blog.torproject.org/../category/releases"&gt;
          releases
        &lt;/a&gt;
      &lt;/li&gt;&lt;/ul&gt;
  &lt;/div&gt;
  &lt;/article&gt;
</content></entry>

you can see there are lots of &lt; and &gt; in these feeds, which are the escapes for < and >. In other words, the descriptions do not actually contain HTML tags, and that's because these feeds are buggy.

Hookshot's PR decided to handle these buggy feeds as their authors intended, but instead it is now broken on correct feeds. For example, if a feed contained this: <description>I want the &lt;blink&gt; tag back</description> then RSS clients are expected to display it like this: I want the <blink> tag back but after that PR, Hookshot will display it like this: I want the tag back and that is incorrect.

Therefore, I won't change Limnoria's behavior to accomodate buggy feeds while breaking correct feeds. The correct solution is to make the feeds' authors fix their feeds.

progval commented 1 year ago

Hmm actually it seems that feedparser (the library Limnoria uses to parse RSS and Atom feeds) has a heuristic to auto-fix such feeds (it detects if a description contains &gt; &lt; and not a single < or >)

What version of feedparser do you have installed?

Mikaela commented 1 year ago

Thank you, I reported this issue to GitHub so far.

My python3-feedparser appears to be 5.2.1-3 and it would be upgradable to 6.0.8-1~bpo11+1 from Debian Backports.