thp / urlwatch

Watch (parts of) webpages and get notified when something changes via e-mail, on your phone or via other means. Highly configurable.
https://thp.io/2008/urlwatch/
Other
2.84k stars 349 forks source link

Html2Text how to add a new line after each element #513

Open ASL07 opened 4 years ago

ASL07 commented 4 years ago

Hi,

Hope you can help me with this. Please forgive me, I am not an expert on html2text. I think this would be easy to do somehow but I cannot find how

I have this url: http://adriansantos.me/test.html and this job:

name: "AS"
url: "http://adriansantos.me/test.html"
max_tries: 2
ssl_no_verify: true
filter:
  - xpath: //*[@class= 'single-opportunity' and span[contains(text(), 'United Kingdom') or contains(text(), 'UNITED KINGDOM')]]
  - html2text:
      method: pyhtml2text
      unicode_snob: true
      body_width: 0
      inline_links: true
      ignore_links: false
      ignore_images: true
      single_line_break: true
  - sort:
---

Which produces the following output:

[ IT Technical Product Owner United Kingdom ](https://career.camlingroup.com/careers/opportunities/tpo-111901-it-technical-product-owner) [ IT Technical 2 United Kingdom ](https://career.camlingroup.com/careers/opportunities/tpo-111901-it-technical-product-owner) [ IT Technical 3 UNITED KINGDOM ](https://career.camlingroup.com/careers/opportunities/tpo-111901-it-technical-product-owner)

How can I make html2text add a new line after each "element"? I mean, how can I achieve this?:

[ IT Technical Product Owner United Kingdom ](https://career.camlingroup.com/careers/opportunities/tpo-111901-it-technical-product-owner)
[ IT Technical 2 United Kingdom ](https://career.camlingroup.com/careers/opportunities/tpo-111901-it-technical-product-owner)
[ IT Technical 3 UNITED KINGDOM ](https://career.camlingroup.com/careers/opportunities/tpo-111901-it-technical-product-owner)

Thanks for your help

thp commented 4 years ago

Did you try removing single_line_break?

ASL07 commented 4 years ago

Yes, that doesn't work either

mborsetti commented 4 years ago

Nothing wrong with html2text: your XPath is passing a series of <a> elements that don't have any separation between them:

<a class="single-opportunity" href="https://career.camlingroup.com/careers/opportunities/tpo-111901-it-technical-product-owner">
                    IT Technical Product Owner <span class="">United Kingdom</span>
                </a>
<a class="single-opportunity" href="https://career.camlingroup.com/careers/opportunities/tpo-111901-it-technical-product-owner">
                    IT Technical 2 <span class="">United Kingdom</span>
                </a>
<a class="single-opportunity" href="https://career.camlingroup.com/careers/opportunities/tpo-111901-it-technical-product-owner">
                    IT Technical 3 <span class="">UNITED KINGDOM</span>
                </a>

If you want line breaks for this specific HTML your XPath needs to capture the outer container as well, in this case a <li>:

filter:
  - xpath: //*[*[@class= 'single-opportunity' and span[contains(text(), 'United Kingdom') or contains(text(), 'UNITED KINGDOM')]]]

This has the desired effect (which, unlike your example above, is sorted correctly):

* [ IT Technical 2 United Kingdom ](https://career.camlingroup.com/careers/opportunities/tpo-111901-it-technical-product-owner)
* [ IT Technical 3 UNITED KINGDOM ](https://career.camlingroup.com/careers/opportunities/tpo-111901-it-technical-product-owner)
* [ IT Technical Product Owner United Kingdom ](https://career.camlingroup.com/careers/opportunities/tpo-111901-it-technical-product-owner)

Alternatively you can insert a re.sub filter to modify the HTML to add a <br> after each <a> element (<a /> for XHTML):

filter:
  - xpath: //*[@class= 'single-opportunity' and span[contains(text(), 'United Kingdom') or contains(text(), 'UNITED KINGDOM')]]
  - re.sub: 
      pattern: </a>
      repl: </a><br>
  - re.sub:
      pattern: <a />
      repl: <a /><br />