Open oxivanisher opened 6 months ago
This seems to work using the XML parser instead of the HTML one, but you do need to specify the namespace correctly:
name: "Factorio Release"
url: 'https://forums.factorio.com/app.php/feed/forum/3'
filter:
- xpath:
path: '//atom:entry[1]/atom:title/text()'
method: xml
namespaces:
atom: 'http://www.w3.org/2005/Atom'
This seems to work using the XML parser instead of the HTML one, but you do need to specify the namespace correctly:
name: "Factorio Release" url: 'https://forums.factorio.com/app.php/feed/forum/3' filter: - xpath: path: '//atom:entry[1]/atom:title/text()' method: xml namespaces: atom: 'http://www.w3.org/2005/Atom'
This works great! So my problem is solved, but I don't know if the issue should be left open, since it probably should work with xpath also?
I'm not sure. Your trying to parse XML with an html parser. From what I could see it should work but doesn't.
I expect a simple test case using lxml etree on its own would be a good start, open an issue on the lxml bug tracker with sample code and see what happens.
I don't see anything wrong with how urlwatch is using the library, but I'm not an expert.
I don't know either. But according to wikipedia XPath stands for "XML Path Language" ... I also found lots of XML examples without searching for it... Maybe the used library is not set out for XML? But that makes also not really sense. Let's keep this here for the moment and see what the dev(s) have to say about this.
By default urlwatch uses the HTMLParser class from lxml etree. My example switches it to the XML parser.
I try to monitor new releases of factorio. But it seems that fields with
CDATA
fields are always returned empty.Factorio publishes the new releases in their phpbb which has a atom feed. The entry that should work IMHO is:
One of the entries looks like this:
I am able to get all the fields not containing a
CDATA
but none containing one. So for example '//entry[1]/id/text()' works without a problem.