hartator / wayback-machine-downloader

Download an entire website from the Wayback Machine.
Other
5.16k stars 677 forks source link

Trouble excluding URLs from download #231

Open eggplantedd opened 1 year ago

eggplantedd commented 1 year ago

I am downloading PDFs off websites which make photographic equipment.

Downloading only PDFs is easy enough, but I would like to exclude any URL which contains the following, case insensitive

printer|manual|investor|environment|report|form|fax|certificate|medical|instruction|rebate

so Manual, Investors, www.website.com/environment2004/ would all be caught.

At the moment I have tried this

/\b(?<!@)(print|manual|investor|environment|report|form|fax|certificate|medical|instruction|rebate)\b/gi

but it continues to download URLs such as

http://www.olympusamerica.com:80/files/FE200BasicManual.pdf http://www.olympusamerica.com/files/Stylus%20740_750%20Instruction%20Manual%20Spanish.pdf http://olympusamerica.com:80/files/Stylus740_750InstructionManual.pdf http://www.olympusamerica.com:80/seg_section/seg_download_mb_file.asp?f=/files/FV300_usersmanual_e.pdf

Can anyone point me to what I'm doing wrong?

The full command is

wayback_machine_downloader http://www.olympusamerica.com* --exclude "/\b(?<!@)(printer|manual|investor|environment|report|form|fax|certificate|medical|instruction|rebate)\b/gi" --only "/\.(pdf)$/i"

giovanni-cutri commented 1 year ago

Your regex does not check if the words are contained in the URL. You need something like this:

wayback_machine_downloader http://www.olympusamerica.com --exclude "/\b(?<!@)^.*(printer|manual|investor|environment|report|form|fax|certificate|medical|instruction|rebate).*?\b/i" --only "/\.(pdf)$/i"

I have tested it with the four URLs you provided and it works.