overcast07 / wayback-machine-spn-scripts

Bash scripts which interact with Internet Archive Wayback Machine's Save Page Now
MIT License
104 stars 9 forks source link

Ability to modify outlink regexes on the fly #11

Closed AlexanderRitter02 closed 2 years ago

AlexanderRitter02 commented 2 years ago

The problem

During a run with outlinks enabled, you sometimes discover links you don't want to save (for example, auth pages or infinitely different urls that point to the same content), that you previously weren't able to account for when putting in the initial regexes. Same with unwanted matches of the outlink regex that the user didn't consider.

Stopping the save is not always an option and comes with it several difficulties (e.g. not saving urls multiple times, not discovering all etc.).

This calls for a way to modify the regex while the program is running.

Suggested solution (new feature)

Have files present in the data directory that contain regexes, and if modified load the new regexes into the program.

I took the functionality of https://github.com/ArchiveTeam/grab-site as an example: They generate an ignores file in the beginning, where you can add a new regex on each line and change old ones. On saving the file, they will be loaded into grab-site, and if one of them matches, the url will be ignored.

For your project, I suggested the following implementation:

  1. In the data folder, generate two additional files:
    • includes.txt: Contains the regex specified in the -o option on its first line
    • excludes.txt: Contains the regex specified in the -x option on its first line
  2. Monitor these files for changes (just as you already do with status_rate.txt and max_parallel_jobs.txt)
  3. If a file was changed, load each line of the file as a regex. A match should occur once one of them matches.
overcast07 commented 2 years ago

This is a good idea. I would need to define behavior for what happens when you restart a session with the -r flag. It would seem to be an improvement on the current behavior if implemented, since currently the outlinks settings aren't kept when restarting.

The way that the existing .txt files are used is that they are just read every time they are used. In this way, the script already checks the files for modifications (by not keeping the data in memory in the first place), though it's perhaps not particularly efficient.

I think loading each line individually shouldn't be necessary, since you could just replace line breaks with the | character (assuming that each line forms a valid regex).

overcast07 commented 2 years ago

For restarting an aborted session, the script would need to be modified to check if includes.txt and excludes.txt exist; if either of them exist, they should be copied into the new directory, but these would still be overridden by usage of -o or -x in the restart command. Additionally, the completed URLs of the original index would have to be retained and appended to the new index, and the same check-if-exists behavior would have to be done for other settings for which it would make sense. I don't think any other changes would be needed.

AlexanderRitter02 commented 2 years ago

Hi, thanks for responding. Detailed analysis, I agree with everything you said.

And yes, only loading one regex from the file is works, the alt character is perfectly fine for that. Only benefit of multiple lines I see is that the regex won't get too long to edit in the file, and because I was used to it from the other program, grab-site. Depends very much on what you think makes more sense, and what is better to implement. Keeping outlink settings when resuming would definitely be expected behaviour.

Don't think I have anything else to add, except maybe a thank you for you sharing these useful scripts :-), appreciate the effort.

overcast07 commented 2 years ago

I have now implemented the suggested behavior; the files are called include_pattern.txt and exclude_pattern.txt instead of the suggested names, newlines are not treated specially, and I didn't modify the behavior of any flags other than -o, -x and -r.