thp / urlwatch

Watch (parts of) webpages and get notified when something changes via e-mail, on your phone or via other means. Highly configurable.
https://thp.io/2008/urlwatch/
Other
2.79k stars 350 forks source link

XML parsing with CDATA not working #817

Open oxivanisher opened 2 months ago

oxivanisher commented 2 months ago

I try to monitor new releases of factorio. But it seems that fields with CDATA fields are always returned empty.

Factorio publishes the new releases in their phpbb which has a atom feed. The entry that should work IMHO is:

name: "Factorio Release"
url: 'https://forums.factorio.com/app.php/feed/forum/3'
filter:
  - xpath: '//entry[1]/title/text()'

One of the entries looks like this:

<entry>
    <author><name><![CDATA[FactorioBot]]></name></author>
    <updated>2024-04-11T15:29:30</updated>
    <published>2024-04-11T15:29:30</published>
    <id>https://forums.factorio.com/viewtopic.php?t=112937&amp;p=608190#p608190</id>
    <link href="https://forums.factorio.com/viewtopic.php?t=112937&amp;p=608190#p608190"/>
    <title type="html"><![CDATA[Releases • Version 1.1.107]]></title>
    <category term="Releases" scheme="https://forums.factorio.com/viewforum.php?f=3" label="Releases"/>
    <content type="html" xml:base="https://forums.factorio.com/viewtopic.php?t=112937&amp;p=608190#p608190"><![CDATA[
    <strong class="text-strong">Modding</strong>  <ul>    <li>Added an optional "mods" to simulation definitions.</li>  </ul><strong class="text-strong">Scripting</strong>  <ul>    <li>Disabled the majority of the lua "debug" library due to security issues.</li>  </ul><strong class="text-strong">Bugfixes</strong>  <ul>    <li>Fixed LuaEntity::set_request_slot would not accept count of 0. (<a href="https://forums.factorio.com/110676" class="postlink">110676</a>)</li>    <li>Fixed first tutorial level advancing to a wrong story step after drill is set in quickbar. (<a href="https://forums.factorio.com/109315" class="postlink">109315</a>)</li>    <li>Fixed mods sorting order by last highlighted and by last updated. (<a href="https://forums.factorio.com/106420" class="postlink">106420</a>)</li>  </ul>Use the automatic updater if you can (check experimental updates in other settings) or download full installation at <a href="https://www.factorio.com/download/experimental" class="postlink">https://www.factorio.com/download/experimental</a>.<p>Statistics: Posted by <a href="https://forums.factorio.com/memberlist.php?mode=viewprofile&amp;u=7177">FactorioBot</a> — Thu Apr 11, 2024 3:29 pm</p><hr />
    ]]></content>
</entry>

I am able to get all the fields not containing a CDATA but none containing one. So for example '//entry[1]/id/text()' works without a problem.

Jamstah commented 1 month ago

This seems to work using the XML parser instead of the HTML one, but you do need to specify the namespace correctly:

name: "Factorio Release"
url: 'https://forums.factorio.com/app.php/feed/forum/3'
filter:
  - xpath:
      path: '//atom:entry[1]/atom:title/text()'
      method: xml
      namespaces:
        atom: 'http://www.w3.org/2005/Atom'
oxivanisher commented 1 month ago

This seems to work using the XML parser instead of the HTML one, but you do need to specify the namespace correctly:

name: "Factorio Release"
url: 'https://forums.factorio.com/app.php/feed/forum/3'
filter:
  - xpath:
      path: '//atom:entry[1]/atom:title/text()'
      method: xml
      namespaces:
        atom: 'http://www.w3.org/2005/Atom'

This works great! So my problem is solved, but I don't know if the issue should be left open, since it probably should work with xpath also?

Jamstah commented 1 month ago

I'm not sure. Your trying to parse XML with an html parser. From what I could see it should work but doesn't.

I expect a simple test case using lxml etree on its own would be a good start, open an issue on the lxml bug tracker with sample code and see what happens.

I don't see anything wrong with how urlwatch is using the library, but I'm not an expert.

oxivanisher commented 1 month ago

I don't know either. But according to wikipedia XPath stands for "XML Path Language" ... I also found lots of XML examples without searching for it... Maybe the used library is not set out for XML? But that makes also not really sense. Let's keep this here for the moment and see what the dev(s) have to say about this.

Jamstah commented 1 month ago

By default urlwatch uses the HTMLParser class from lxml etree. My example switches it to the XML parser.

Jamstah commented 1 month ago

FYI: https://bugs.launchpad.net/lxml/+bug/2067707