thp / urlwatch

Watch (parts of) webpages and get notified when something changes via e-mail, on your phone or via other means. Highly configurable.
https://thp.io/2008/urlwatch/
Other
2.81k stars 352 forks source link

Adds option to keep more than 1 historical snapshot when running with --gc-cache #732

Closed trevorshannon closed 1 year ago

trevorshannon commented 1 year ago

This is feature will be convenient for me, so I figured I'd make a PR in case you want to merge it in. It generally adds flexibility to the db garbage collection while preserving existing behavior.

In summary:

I didn't see any existing tests for storage.py so I just did some tests on the command line. I was not able to test with a redis db, but if required, I'm sure I could figure out how to get it set up.

Here are the test results.

First, the jobs:

➜  urlwatch git:(feat/gc-limit) cat ~/urls.yaml                                                        
command: date
kind: shell
name: watchdog
---
command: date -R
kind: shell
name: watchdog2

The first job has some history whereas the second job is brand new:

➜  urlwatch git:(feat/gc-limit) ./urlwatch --urls ~/urls.yaml --config ~/urlwatch.yaml --dump-history 1
==============================
2022-12-09 13:52
------------------------------
Fri Dec  9 13:52:24 PST 2022

============================== 

==============================
2022-12-09 13:53
------------------------------
Fri Dec  9 13:53:13 PST 2022

============================== 

==============================
2022-12-09 13:53
------------------------------
Fri Dec  9 13:53:15 PST 2022

============================== 

3 historic snapshot(s) available
➜  urlwatch git:(feat/gc-limit) ./urlwatch --urls ~/urls.yaml --config ~/urlwatch.yaml --dump-history 2
0 historic snapshot(s) available

Running with --gc-cache 2 retains the 2 most recent snapshots for job 1 and does nothing for job 2:

➜  urlwatch git:(feat/gc-limit) ./urlwatch --urls ~/urls.yaml --config ~/urlwatch.yaml --gc-cache 2    
Removed 1 old versions of e927d0677c77241b707442314346326278051dd6
➜  urlwatch git:(feat/gc-limit) ./urlwatch --urls ~/urls.yaml --config ~/urlwatch.yaml --dump-history 1
==============================
2022-12-09 13:53
------------------------------
Fri Dec  9 13:53:13 PST 2022

============================== 

==============================
2022-12-09 13:53
------------------------------
Fri Dec  9 13:53:15 PST 2022

============================== 

2 historic snapshot(s) available
➜  urlwatch git:(feat/gc-limit) ./urlwatch --urls ~/urls.yaml --config ~/urlwatch.yaml --dump-history 2
0 historic snapshot(s) available

Populate the snapshot history a bit for both jobs, resulting in 4 snapshots for job 1 and 2 snapshots for job 2...

➜  urlwatch git:(feat/gc-limit) ./urlwatch --urls ~/urls.yaml --config ~/urlwatch.yaml                 
===========================================================================
01. CHANGED: watchdog
02. NEW: watchdog2
===========================================================================

➜  urlwatch git:(feat/gc-limit) ./urlwatch --urls ~/urls.yaml --config ~/urlwatch.yaml
===========================================================================
01. CHANGED: watchdog
02. CHANGED: watchdog2
===========================================================================

➜  urlwatch git:(feat/gc-limit) ./urlwatch --urls ~/urls.yaml --config ~/urlwatch.yaml --dump-history 1
==============================
2022-12-09 13:53
------------------------------
Fri Dec  9 13:53:13 PST 2022

============================== 

==============================
2022-12-09 13:53
------------------------------
Fri Dec  9 13:53:15 PST 2022

============================== 

==============================
2022-12-09 13:54
------------------------------
Fri Dec  9 13:54:28 PST 2022

============================== 

==============================
2022-12-09 13:54
------------------------------
Fri Dec  9 13:54:31 PST 2022

============================== 

4 historic snapshot(s) available
➜  urlwatch git:(feat/gc-limit) ./urlwatch --urls ~/urls.yaml --config ~/urlwatch.yaml --dump-history 2
==============================
2022-12-09 13:54
------------------------------
Fri, 09 Dec 2022 13:54:28 -0800

============================== 

==============================
2022-12-09 13:54
------------------------------
Fri, 09 Dec 2022 13:54:31 -0800

============================== 

2 historic snapshot(s) available

... and run with no retain limit argument (i.e. --gc-cache), which preserves the old behavior of keeping the latest 1 snapshot:

➜  urlwatch git:(feat/gc-limit) ./urlwatch --urls ~/urls.yaml --config ~/urlwatch.yaml --gc-cache      
Removed 3 old versions of e927d0677c77241b707442314346326278051dd6
Removed 1 old versions of cdb05c1ba536b611913231a4561ab66b75774277
➜  urlwatch git:(feat/gc-limit) ./urlwatch --urls ~/urls.yaml --config ~/urlwatch.yaml --dump-history 1
==============================
2022-12-09 13:54
------------------------------
Fri Dec  9 13:54:31 PST 2022

============================== 

1 historic snapshot(s) available
➜  urlwatch git:(feat/gc-limit) ./urlwatch --urls ~/urls.yaml --config ~/urlwatch.yaml --dump-history 2
==============================
2022-12-09 13:54
------------------------------
Fri, 09 Dec 2022 13:54:31 -0800

============================== 

1 historic snapshot(s) available

Invalid retain limit args are rejected:

➜  urlwatch git:(feat/gc-limit) ./urlwatch --urls ~/urls.yaml --config ~/urlwatch.yaml --gc-cache 0    
Traceback (most recent call last):
  File "./urlwatch", line 9, in <module>
    main()
  File "/Users/trevorshannon/projects/urlwatch/lib/urlwatch/cli.py", line 112, in main
    urlwatch_command.run()
  File "/Users/trevorshannon/projects/urlwatch/lib/urlwatch/command.py", line 431, in run
    self.handle_actions()
  File "/Users/trevorshannon/projects/urlwatch/lib/urlwatch/command.py", line 224, in handle_actions
    self.urlwatcher.cache_storage.gc([job.get_guid() for job in self.urlwatcher.jobs], self.urlwatch_config.gc_cache)
  File "/Users/trevorshannon/projects/urlwatch/lib/urlwatch/storage.py", line 472, in gc
    raise ValueError(f'Cache garbage collection must retain at least 1 historical snapshot per job (requested: {retain_limit})')
ValueError: Cache garbage collection must retain at least 1 historical snapshot per job (requested: 0)
➜  urlwatch git:(feat/gc-limit) ./urlwatch --urls ~/urls.yaml --config ~/urlwatch.yaml --gc-cache -2
Traceback (most recent call last):
  File "./urlwatch", line 9, in <module>
    main()
  File "/Users/trevorshannon/projects/urlwatch/lib/urlwatch/cli.py", line 112, in main
    urlwatch_command.run()
  File "/Users/trevorshannon/projects/urlwatch/lib/urlwatch/command.py", line 431, in run
    self.handle_actions()
  File "/Users/trevorshannon/projects/urlwatch/lib/urlwatch/command.py", line 224, in handle_actions
    self.urlwatcher.cache_storage.gc([job.get_guid() for job in self.urlwatcher.jobs], self.urlwatch_config.gc_cache)
  File "/Users/trevorshannon/projects/urlwatch/lib/urlwatch/storage.py", line 472, in gc
    raise ValueError(f'Cache garbage collection must retain at least 1 historical snapshot per job (requested: {retain_limit})')
ValueError: Cache garbage collection must retain at least 1 historical snapshot per job (requested: -2)
trevorshannon commented 1 year ago

I could not determine if manpage.rst is auto-generated, so I did not edit that file in case it is.

thp commented 1 year ago

I could not determine if manpage.rst is auto-generated, so I did not edit that file in case it is.

Please do edit it and then run update-manpages.sh and check in the results (the manpages are generated from the ReStructuredText files).

thp commented 1 year ago

Looks good so far. Need to fix CI (unrelated to this PR).

thp commented 1 year ago

Please rebase against master so that the CI tests can run though (see #733).

trevorshannon commented 1 year ago

Thanks for taking a look. I made the edit to manpage.rst and ran bash update-manpages.sh. There are more edits than I expected to the man pages, so perhaps the source and output were out of sync? I've never used sphinx; maybe there is also some sphinx user setting that's different between our two setups. I'm happy to change anything if needed.

thp commented 1 year ago

Thanks for taking a look. I made the edit to manpage.rst and ran bash update-manpages.sh. There are more edits than I expected to the man pages, so perhaps the source and output were out of sync? I've never used sphinx; maybe there is also some sphinx user setting that's different between our two setups. I'm happy to change anything if needed.

Yeah you're right. If you don't mind, maybe just update the sources in docs/ and not yet re-generate the manpages (I can do that as part of the next release) (basically remove the changes in share/man/ from 436b4dec7b7986c796c49ef8c075e06f817ffb8e). Would that work for you (git rebase + edit the commit and remove the share/man changes)? I think after that change, if I haven't missed something, this is ready to be merged.

trevorshannon commented 1 year ago

No problem, please take a look!