novoid / lazyblorg

Blogging with Org-mode for very lazy people
GNU General Public License v3.0
396 stars 33 forks source link

Robust Atom/RSS feed generator #24

Open novoid opened 5 years ago

novoid commented 5 years ago

With my simple Atom/RSS feed generating code, I often had issues with some Atom/RSS aggregators in combination with special characters, encoding, and such. Images don't work so far within Atom/RSS feeds.

Maybe somebody want to volunteer to add a decent Atom/RSS support that generates rock solid Atom/RSS feeds so that people might follow the "full content" feed and not the "link only" feed which I am forced to recommend.

novoid commented 3 years ago

Main issues I could identify so far were related to HTML snippets from Twitter and Youtube.

novoid commented 3 years ago

Current feed is broken again. Can be verified via Thunderbird (stefan2904 reported issue) or W3C Validator:

This feed does not validate.

    line 216, column 151: XML parsing error: <unknown>:216:151: not well-formed (invalid token) [help]

        ... up/ghxrq87/?utm_source=reddit&utm_medium=web2x&amp;context=3">this reddi ...
                                                     ^

In addition, interoperability with the widest range of feed readers could be improved by implementing the following recommendations.

    line 5, column 99: Self reference doesn't match document location [help]

        ... g-all.atom_1.0.links-and-content.xml" />
                                                     ^

    line 7, column 26: Identifier "http://Karl-Voit.at/" is not in canonical form (the canonical form would be "http://karl-voit.at/") (2 occurrences) [help]

          <id>http://Karl-Voit.at/</id>
                                  ^
stefan2904 commented 3 years ago

@novoid I think merging of PR #58 accidentally closed this issue ... I don't think the PR fixes the overall robustness problems. ;-)

novoid commented 3 years ago

Thanks @stefan2904 ,

I had to revert your commits because it resulted in unwanted replacements such as &lt; to &amp;lt;.

Note to myself: there is no distinction between HTML-content for the feeds and the blog articles. Therefore, with the current approach, there are following possibilities:

  1. Parse the generated feed data and do a smart search & replace of the characters like multiple "&" within URLs only.
    • Would require: list of characters to replace (and which cause errors); a parser which results only in replacements that are necessary.
  2. Another attempt to switch to CDATA for the feeds
  3. Evaluate and integrate any external feed library (a dependency I would like not to have when possible).
novoid commented 2 years ago

Note: maybe https://sr.ht/~brettgilio/org-webring/ could be part of the solution?

novoid commented 2 years ago

Another workaround is in progress (althoug not particularily for this issue): replacing all external content (iframes from YouTube, Mastodon, Twitter) with images and links.

Update 2021-12-29: with the most recent implementation of img-images that are a a-href link, this approach will be started: I'll replace all my embedded Twitter/Mastodon/YouTube snippets with screenshots + links. While this doesn't fix the original issue with broken feed files, it avoids them because almost every feed issue is related to external content.

btrummer commented 2 years ago

Current XML parse error in https://karl-voit.at/feeds/lazyblorg-all.atom_1.0.links-and-teaser.xml:

Line 310: <a href="https://duckduckgo.com/?t=ffab&q=impfskeptik+deutschsprachig+europa&amp;ia=web">und weitere</a>

The first '&' is not replaced with '&', causing an XML parse error in KDE kontact.

novoid commented 2 years ago

Current XML parse error in https://karl-voit.at/feeds/lazyblorg-all.atom_1.0.links-and-teaser.xml:

Line 310: <a href="https://duckduckgo.com/?t=ffab&q=impfskeptik+deutschsprachig+europa&amp;ia=web">und weitere</a>

The first '&' is not replaced with '&', causing an XML parse error in KDE kontact.

Thanks for reporting. This is not related to the issue here. It's a new bug which is handled in #64.