pipes-digital / pipes

Repository for Pipes
https://pipes.digital
GNU Affero General Public License v3.0
265 stars 21 forks source link

Cannot parse a malformed feed #91

Open anewuser opened 2 years ago

anewuser commented 2 years ago

This feed currently contains an encoding issue and Pipes cannot parse it: http://feedrinse.com/services/channel/?chanurl=882eadeaf9ef2636f65d656793114983 .

I've reported this to Feed Rinse, but could you take a look to see if you can find a workaround? SimplePie can handle it and just replaces the broken character with a normal question mark: https://www.simplepie.org/demo/?feed=http%3A%2F%2Ffeedrinse.com%2Fservices%2Fchannel%2F%3Fchanurl%3D882eadeaf9ef2636f65d656793114983

onli commented 2 years ago

Hey. Not sure how to solve this. This is the error we get:

2022-04-22 20:27:29 - ArgumentError - invalid byte sequence in UTF-8:
    .../pipes/vendor/bundle/ruby/3.1.0/gems/rss-0.2.9/lib/rss/parser.rb:132:in `maybe_xml?'

The problematic String seems to be this:

<title>UNDER FALL JUSTICE オンライン限定シングル「壊れたオモチャ」2022年3月31日� ...</title>

I assume that the � marks that this was maybe a cut multi byte character?

There are some workarounds, but they involve setting the encoding of that string of the RSS feed with something like https://ruby-doc.org/core-2.7.0/String.html#method-i-encode or guessing the encoding with a gem. I'd be very worried about breaking a lot of things with that.

In general I agree that Pipes should just handle this feed somehow, but this is such a wide field of potential issues that hoping the input feed gets fixed seems like a better option to me. But input on how to safely solve this is welcome, I certainly might be wrong here.

anewuser commented 2 years ago

Thank you for looking into it.

The problematic String seems to be this:

Yes, the original title says 31日発売.

hoping the input feed gets fixed seems like a better option to me

Alright. This issue is too specific to worry too much about it. FeedRinse has some internal filters too, and I can use them to remove problematic posts in case they don't fix their parser soon.

anewuser commented 2 years ago

@onli What about this other case?

Something is causing one my subscriptions to add invalid XML entities to item titles. I believe it's always a series of &#0;. I've contacted the site owner about it, but he never replied. The problem goes away for a while but then returns.

Removing the invalid character entities from the code at the moment the feed is downloaded would be enough to fix it. I can't do it with a replace block, though. As soon as the code goes through a block, Pipes stops parsing it as XML.

<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
<channel>
  <title>Example</title>
  <link>https://example.com/</link>
  <description>Example feed</description>
  <item>
    <title>Bad title &#0;&#0;&#0;&#0;&#0;</title>
    <link>https://example.com/8325262</link>
    <description>An item with a bad title</description>
  </item>
  <item>
    <title>Good title</title>
    <link>https://example.com/4325262</link>
    <description>An item with a good title</description>
  </item>
</channel>
</rss>