overcast07 / wayback-machine-spn-scripts

Bash scripts which interact with Internet Archive Wayback Machine's Save Page Now
MIT License
101 stars 9 forks source link

Not working on PDF urls #15

Closed ElDubsNZ closed 1 year ago

ElDubsNZ commented 1 year ago

I've got this running on a list of URLs and it's working great, except for any URLs in that list that are a PDF. However if I use another service like the wayback machine extension, this works perfectly fine on PDF files. Any particular reason this script might not work with them? I've tried several times now.

As an example, I tested "spn.sh https://www.parliament.nz/media/9655/corrections-victim-protection-amendment-bill.pdf", which comes back saying [Job completed], but it's not completed. Testing this on other PDFs I get the same thing, but again if I use the Chrome/Edge extension, it works fine.

Within the txt file that I run on a list of PDF webpages, it changes it to have a "$" at the end of every URL, and some lines in between with only "$".

overcast07 commented 1 year ago

What operating system are you using the script on, and what shell are you using? I'm not able to reproduce the issue in bash/zsh on macOS, and I haven't had that issue before. I have not tested the script in other Unix shells.

When archiving that URL using the command prompt input that you provided, I got this in the file success-json.log (which presumably corresponds to your capture from earlier):

{"job_id":"spn2-710f366e875a7b05339ee2933ce73423abb4ee66","duration_sec":0.178,"original_url":"https://www.parliament.nz/media/9655/corrections-victim-protection-amendment-bill.pdf","status":"success","counters":{"outlinks":0,"embeds":0},"timestamp":"20220923140652","http_status":200,"first_archive":true,"resources":[]}

This indicates to me that the script worked, since the original_url value is correct.

ElDubsNZ commented 1 year ago

I've got it running as a scheduled crontab on Ubuntu Server 22.04.

I have it run every 12 hours. However after each time, it fills the txt file with those $ symbols, and checking the calendar on the link provided will show it is not archiving that frequency. However all non pdf links work absolutely perfectly. Not a single issue.

overcast07 commented 1 year ago

What's the name of the txt file that's being filled?

ElDubsNZ commented 1 year ago

pdfs.txt

It does recognise the urls in it, as I see it run through each URL.

overcast07 commented 1 year ago

At what point do the $ characters get added, and are they added to both the input file (i.e. pdfs.txt) and the index.txt file in the folder used/created by the running script? Does this happen immediately upon the script starting?

The script shouldn't be modifying the input file or the input data, obviously, so I'm inclined to assume that it's something with the environment rather than something in the script itself. I don't really know what could be causing this, though. Sorry if this isn't helpful.

ElDubsNZ commented 1 year ago

I'll check into that the next time the script is due to run and come back to you. It's no problem! I appreciate the effort.

EDIT: I have a theory and unfortunately it will take a few days to test, so I'll come back with an update then!

ElDubsNZ commented 1 year ago

I've resolved this issue. The "?" issue was a fault of my own. It was an issue with crontab running my script as root user, while when I manually ran it as my user, it worked fine. I've adjusted my script to solve that.

Interestingly though, saves of the PDF files still were not backing up to the wayback machine. But after a few days, they appear. I noticed this earlier when I was manually doing them, so thought I'd test again, and turns out that's right. Every few days, the last few days of PDF backups will appear.

Script is now running correctly, nothing was wrong with your script, but I really appreciate you looking into it.

TheTechRobo commented 1 year ago

Interestingly though, saves of the PDF files still were not backing up to the wayback machine. But after a few days, they appear. I noticed this earlier when I was manually doing them, so thought I'd test again, and turns out that's right. Every few days, the last few days of PDF backups will appear.

The WBM sometimes takes about 12 hours to index captures. That helps reduce load on their systems.