novoid / lazyblorg

Blogging with Org-mode for very lazy people
GNU General Public License v3.0
397 stars 33 forks source link

Fix bug: only one of two ampersants in one URL gets sanitized in XML feed #64

Closed novoid closed 2 years ago

novoid commented 2 years ago

Issue was found by @btrummer at https://github.com/novoid/lazyblorg/issues/24#issuecomment-1002472202:


Current XML parse error in https://karl-voit.at/feeds/lazyblorg-all.atom_1.0.links-and-teaser.xml:

Line 310: <a href="https://duckduckgo.com/?t=ffab&q=impfskeptik+deutschsprachig+europa&amp;ia=web">und weitere</a>

The first & is not replaced with &amp;, causing an XML parse error in KDE kontact.


This is most probably not related to the usual XML feed errors from external content.

novoid commented 2 years ago

@btrummer According to https://stackoverflow.com/questions/281682/reference-to-undeclared-entity-exception-while-working-with-xml I should have replaced them with their numeric value such as &#38; according to this.

I played around by manually fixing the feed and re-starting the validator from https://validator.w3.org/feed/check.cgi?url=https%3A%2F%2Fkarl-voit.at%2Ffeeds%2Flazyblorg-all.atom_1.0.links-and-teaser.xml

The numeric code was not valid for the validator.

The correct form for the feed according to the validator is:

<a href="https://duckduckgo.com/?t=ffab&amp;q=impfskeptik+deutschsprachig+europa&amp;ia=web">und weitere</a>

Funny enough, the error is introduced in the function fix_ampersands_in_url() which reverses the & to &amp; replacement in URLs but fails for more than one ampersand in one URL which was already noted in the function description.

Unfortunately, I can't remember the details why fix_ampersands_in_url() was introduced in the first place when ampersands need to be replaced by their HTML entity counterpart.

Now I changed the behavior of fix_ampersands_in_url() so that ampersands are replaced by their HTML entity exactly once.