thp / urlwatch

Watch (parts of) webpages and get notified when something changes via e-mail, on your phone or via other means. Highly configurable.
https://thp.io/2008/urlwatch/
Other
2.82k stars 352 forks source link

diff(erent) data types #183

Closed julianuu closed 4 years ago

julianuu commented 6 years ago

Thanks a lot for your work! I find it really useful.

A wish: I would like to monitor a website which contains a list of links to pdf files and would like to know if the pdfs change. I could imagine that a way to do this currently, is create a filter which finds the links, downloads the pdfs and turns them into text. But thats not really cool and maybe someone would also like to monitor different files which can not be easily turned into text. Do you know a way to make it work? EDIT: I just saw there is a sha1sum-filter. Ok that would work for detecting that something has changed.

I have not yet understood how the whole program is structured, but my impression is that it contains this workflow: get data -> filter -> store data, diff&report is this correct? If yes, one way of making my wish work would be to change this into get data -> filter -> store data, diff -> report and let filter and diff also pass on the information what the data type is that is being handled and then add a differ that can diff pdfs

What do you think? I know maybe it requires too much restructuring to implement this feature, but at the moment I use a script I wrote myself and yours has way more features and is way more stable and easy to use, so for me it feels like a waste of time to put more work into my crappy version when something like this here already exists ;)

ad-m commented 6 years ago

Write script which convert pdf to text in stdout and use shell instead url job.

thp commented 4 years ago

There's now a pdf2text filter implememented in the master branch.