Open eggplantedd opened 1 year ago
Your regex does not check if the words are contained in the URL. You need something like this:
wayback_machine_downloader http://www.olympusamerica.com --exclude "/\b(?<!@)^.*(printer|manual|investor|environment|report|form|fax|certificate|medical|instruction|rebate).*?\b/i" --only "/\.(pdf)$/i"
I have tested it with the four URLs you provided and it works.
I am downloading PDFs off websites which make photographic equipment.
Downloading only PDFs is easy enough, but I would like to exclude any URL which contains the following, case insensitive
printer|manual|investor|environment|report|form|fax|certificate|medical|instruction|rebate
so Manual, Investors, www.website.com/environment2004/ would all be caught.
At the moment I have tried this
/\b(?<!@)(print|manual|investor|environment|report|form|fax|certificate|medical|instruction|rebate)\b/gi
but it continues to download URLs such as
http://www.olympusamerica.com:80/files/FE200BasicManual.pdf http://www.olympusamerica.com/files/Stylus%20740_750%20Instruction%20Manual%20Spanish.pdf http://olympusamerica.com:80/files/Stylus740_750InstructionManual.pdf http://www.olympusamerica.com:80/seg_section/seg_download_mb_file.asp?f=/files/FV300_usersmanual_e.pdf
Can anyone point me to what I'm doing wrong?
The full command is
wayback_machine_downloader http://www.olympusamerica.com* --exclude "/\b(?<!@)(printer|manual|investor|environment|report|form|fax|certificate|medical|instruction|rebate)\b/gi" --only "/\.(pdf)$/i"