mborsetti / webchanges

webchanges anonymously checks web content (including images) and commands for changes, delivering instant notifications and AI-powered summaries to your favorite platform.
https://pypi.org/project/webchanges/
Other
37 stars 6 forks source link

[FEATURE] Rewrite relative links in the output #62

Closed pawelpbm closed 10 months ago

pawelpbm commented 11 months ago

My filter is monitoring when new link is added on the website and then returns content of the <a> tag. It's fine if the page contains absoulte links, but many page contsins only relative links.

It would be cool if webchanges could replace relative links with abosulte links. I tried to find some nice command line tool that would do that, but there are only some hacky ways listed here.

Also as noted in the above link it should be easy to do that with BeatifulSoup, but I would probably see it as separate filter than beautify.

pawelpbm commented 11 months ago

So far I'm using the following script as filter:

#!/usr/local/bin/python3

import sys
import os
from urllib.parse import urljoin
from bs4 import BeautifulSoup

content = sys.stdin.read()

soup = BeautifulSoup(content)
job_location = os.getenv('WEBCHANGES_JOB_LOCATION')

for anchor in soup.findAll('a', href=True):
    anchor['href'] =  urljoin(job_location, anchor.get('href'))

print(str(soup))

I think it would be useful if webchanges could do it natively.

mborsetti commented 11 months ago

Great idea, thanks!

Let me give it some thought; it's probably more lightweight to do it using urllib.parse.urljoin using by lxml to extract the tags, and it would not require additional dependencies (i.e. BeautifulSoup).

mborsetti commented 11 months ago

It just dawned on me that relative links should be automatically rendered as absolute links in the html report.

Can you please confirm whether you have a use case with text or markup report types or it's a problem with the html report?

pawelpbm commented 11 months ago

I'm using report type email with html: true, but TBH I'm not sure if it's actually based on html report or on text. I think when I wanted to have separate emails per url I had to change separate: true in text section.

How can I actually confirm that?

pawelpbm commented 11 months ago

I do have coloring in the emails, as on the screenshot.

Screenshot 2023-11-05 at 23 28 24
mborsetti commented 11 months ago

OK, I looked at the code and it's html2text that modifies the relative links to make them absolute. Does using that filter work in your user case?

pawelpbm commented 11 months ago

While html2text indeed converts the links to absolute it also "renders" the HTML. This means that instead of simple and very readabe diff I'm getting a lot of mess...

mborsetti commented 11 months ago

While html2text indeed converts the links to absolute it also "renders" the HTML. This means that instead of simple and very readabe diff I'm getting a lot of mess...

I don't quite understand the setup or data that you have since that setup (html2text plus an html reporter) is typically the most readable one to track HTML sources, and one that I use all the time, and the diffs are very readable and clockable.

I will add the recommended filter at the next release for your use.

Thanks for the contribution, very much appreciated!

mborsetti commented 11 months ago

P.S. if you have any suggested names for the filter, I am all ears!

pawelpbm commented 11 months ago

Maybe I'm doing something wrong, I'm pretty new to using webchanges.

This what I'm getting without the html2text and I actually quite like that format: https://drive.google.com/file/d/1RYS-7mmZDBVDoczVdalYCAF4-EH9g3SR/view?usp=sharing

That's what I'm getting with html2text: https://drive.google.com/file/d/1Xj5733ffmEzFDM-zv86GNJ_B7YkBGeWL/view?usp=sharing

Both screenshots from webinterface in Gmail.

mborsetti commented 10 months ago

Implemented in v3.16: https://webchanges.readthedocs.io/en/stable/filters.html#absolute-links

Please let me know if there are problems with it (and I just noticed the error in the documentation, which incorrectly shows it new in version 3.17)