mborsetti / webchanges

webchanges anonymously checks web content (including images) and commands for changes, delivering instant notifications and AI-powered summaries to your favorite platform.
https://pypi.org/project/webchanges/
Other
37 stars 6 forks source link

error when writing xml path for path with colon #31

Closed ghost closed 2 years ago

ghost commented 2 years ago

Not sure it's a bug in me or the software.

I have a job:

name: 21_SwimSwam News - Time Standard
url: "https://swimswam.com/feed/#1"
filter:
   - xpath:
      method: xml
      path: '//item/title/text()|//item/description/text()'
      exclude: 'a'
   - html2text:
      method: bs4
   - keep_lines_containing:
      re: '(?i)usa\sswimming|time\sstandard|ruta'
   - re.sub: '(?m)^[ \t]*'
additions_only: true

According to the docs I should be able to add this line to the path but I get an error.

//item/content:encoded/text() path: '//item/title/text()|//item/description/text()|//item/content:encoded/text()'

https://webchanges.readthedocs.io/en/stable/filters.html?highlight=RSS#using-css-and-xpath-filters-with-xml-and-exclusions

Seems I am having an issue with the colon.

mborsetti commented 2 years ago

Hmmm... Its Either YAML or some legacy code; I'll look into it.

Can you please run the job with '--verbose' and paste at least the top lines (with version and system info) and the full traceback and error message?

mborsetti commented 2 years ago

Didn't realize from mobile that you posted the complete job. The error I got is

    selected_elems = root.xpath(self.expression, namespaces=self.namespaces)
  File "src\lxml\etree.pyx", line 1597, in lxml.etree._Element.xpath
  File "src\lxml\xpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.__call__
  File "src\lxml\xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result
lxml.etree.XPathEvalError: Undefined namespace prefix

I don't know XML, but a Google search led me to https://stackoverflow.com/questions/44188237/python-parse-xml-feed-error-xpathevalerror-undefined-namespace-prefix which indicates that you probably need to define namespaces; the sub-directive namespaces is documented here https://webchanges.readthedocs.io/en/stable/filters.html#css-and-xpath.

Not knowing XML I really can't help you further, but hope to have pointed you to the right direction!

ghost commented 2 years ago

That was it... thanks for pointing me in the right direction.

working job below:

name: 21_SwimSwam News - Time Standard
url: "https://swimswam.com/feed/#1"
filter:
   - xpath:
      method: xml
      path: '//item/title/text()//item/description/text()|//item/content:encoded/text()'
      namespaces:
        content: http://purl.org/rss/1.0/modules/content/
      exclude: 'a'
   - keep_lines_containing:
      re: '(?i)usa\sswimming|time\sstandard'
   - html2text: re
additions_only: true
mborsetti commented 2 years ago

Yeah, Google's pretty powerful! Glad it worked out.

ghost commented 2 years ago

not until you have the magic word and an example Thanks again.

On Apr 7, 2022, at 7:05 PM, Mike Borsetti @.***> wrote:

 Yeah, Google's pretty powerful! Glad it worked out.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you modified the open/close state.