Closed ElDubsNZ closed 2 years ago
What operating system are you using the script on, and what shell are you using? I'm not able to reproduce the issue in bash/zsh on macOS, and I haven't had that issue before. I have not tested the script in other Unix shells.
When archiving that URL using the command prompt input that you provided, I got this in the file success-json.log
(which presumably corresponds to your capture from earlier):
{"job_id":"spn2-710f366e875a7b05339ee2933ce73423abb4ee66","duration_sec":0.178,"original_url":"https://www.parliament.nz/media/9655/corrections-victim-protection-amendment-bill.pdf","status":"success","counters":{"outlinks":0,"embeds":0},"timestamp":"20220923140652","http_status":200,"first_archive":true,"resources":[]}
This indicates to me that the script worked, since the original_url
value is correct.
I've got it running as a scheduled crontab on Ubuntu Server 22.04.
I have it run every 12 hours. However after each time, it fills the txt file with those $ symbols, and checking the calendar on the link provided will show it is not archiving that frequency. However all non pdf links work absolutely perfectly. Not a single issue.
What's the name of the txt file that's being filled?
pdfs.txt
It does recognise the urls in it, as I see it run through each URL.
At what point do the $ characters get added, and are they added to both the input file (i.e. pdfs.txt
) and the index.txt
file in the folder used/created by the running script? Does this happen immediately upon the script starting?
The script shouldn't be modifying the input file or the input data, obviously, so I'm inclined to assume that it's something with the environment rather than something in the script itself. I don't really know what could be causing this, though. Sorry if this isn't helpful.
I'll check into that the next time the script is due to run and come back to you. It's no problem! I appreciate the effort.
EDIT: I have a theory and unfortunately it will take a few days to test, so I'll come back with an update then!
I've resolved this issue. The "?" issue was a fault of my own. It was an issue with crontab running my script as root user, while when I manually ran it as my user, it worked fine. I've adjusted my script to solve that.
Interestingly though, saves of the PDF files still were not backing up to the wayback machine. But after a few days, they appear. I noticed this earlier when I was manually doing them, so thought I'd test again, and turns out that's right. Every few days, the last few days of PDF backups will appear.
Script is now running correctly, nothing was wrong with your script, but I really appreciate you looking into it.
Interestingly though, saves of the PDF files still were not backing up to the wayback machine. But after a few days, they appear. I noticed this earlier when I was manually doing them, so thought I'd test again, and turns out that's right. Every few days, the last few days of PDF backups will appear.
The WBM sometimes takes about 12 hours to index captures. That helps reduce load on their systems.
I've got this running on a list of URLs and it's working great, except for any URLs in that list that are a PDF. However if I use another service like the wayback machine extension, this works perfectly fine on PDF files. Any particular reason this script might not work with them? I've tried several times now.
As an example, I tested "spn.sh https://www.parliament.nz/media/9655/corrections-victim-protection-amendment-bill.pdf", which comes back saying [Job completed], but it's not completed. Testing this on other PDFs I get the same thing, but again if I use the Chrome/Edge extension, it works fine.
Within the txt file that I run on a list of PDF webpages, it changes it to have a "$" at the end of every URL, and some lines in between with only "$".