Open ghiathkamel opened 3 months ago
+1
Came here to see if anyone else had the same issues. Did web.archive.org nuke the endpoint recently?
Came here to see if anyone else had the same issues. Did web.archive.org nuke the endpoint recently?
I think they blocked the downloader
This tool is listed on the Archive Wiki so I'd be interested to hear if this was an intended blocking of the tool.
EDIT: Tool hasn't been blocked, but rather I think hasn't been updated to reflect changes on the Wayback Machine. This other issue provides instructions on how to use a fork has the necessary fixes applied.
It appears that the wayback server has been overwhelmed by download activity and they are actively attempting to reduce traffic. I have made the following modifications to wayback_machine_downloader to slow it down which significantly reduces but does not eliminate the problem. To enable the downloading of all pages a retry was added for the few connection refused errors that still occur. I suspect that slowing it down even more would also eliminate the errors but this is a compromise between speed and playing nice.
First locate the ruby file by running the following command:
gem env
The source file "wayback_machine_downloader.rb" should be located in one of the GEM PATHS.
With your editor of choice open wayback_machine_downloader.rb
unless File.exist? file_path
begin
structure_dir_path dir_path
open(file_path, "wb") do |file|
begin
URI("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}").open("Accept-Encoding" => "plain") do |uri|
file.write(uri.read)
end
rescue OpenURI::HTTPError => e
puts "(1) - #{file_url} # #{e}"
if @all
file.write(e.io.read)
puts "(2) - #{file_path} saved anyway."
end
rescue StandardError => e
puts "(3) - #{file_url} # #{e}"
sleep(30) <<< INSERT
retry <<< INSERT
end
end
rescue StandardError => e
puts "(4) - #{file_url} # #{e}"
ensure
if not @all and File.exist?(file_path) and File.size(file_path) == 0
File.delete(file_path)
puts "(5) - #{file_path} was empty and was removed."
end
end
semaphore.synchronize do
@processed_file_count += 1
puts "(6) - #{file_url} -> #{file_path} (#{@processed_file_count}/#{file_list_by_timestamp.size})"
end
sleep(2) <<< INSERT
else
semaphore.synchronize do
@processed_file_count += 1
puts "(7) - #{file_url} # #{file_path} already exists. (#{@processed_file_count}/#{file_list_by_timestamp.size})"
end
end
end
Hello
The tool not working anymore
Getting snapshot pages....................C:/Ruby26-x64/lib/ruby/2.6.0/net/http.rb:949:in
rescue in block in connect': Failed to open TCP connection to web.archive.org:443 (No connection could be made because the target machine actively refused it. - connect(2) for "web.archive.org" port 443) (Errno::ECONNREFUSED) from C:/Ruby26-x64/lib/ruby/2.6.0/net/http.rb:946:in
block in connect' from C:/Ruby26-x64/lib/ruby/2.6.0/timeout.rb:93:inblock in timeout' from C:/Ruby26-x64/lib/ruby/2.6.0/timeout.rb:103:in
timeout' from C:/Ruby26-x64/lib/ruby/2.6.0/net/http.rb:945:inconnect' from C:/Ruby26-x64/lib/ruby/2.6.0/net/http.rb:930:in
do_start' from C:/Ruby26-x64/lib/ruby/2.6.0/net/http.rb:919:instart' from C:/Ruby26-x64/lib/ruby/2.6.0/open-uri.rb:337:in
open_http' from C:/Ruby26-x64/lib/ruby/2.6.0/open-uri.rb:756:inbuffer_open' from C:/Ruby26-x64/lib/ruby/2.6.0/open-uri.rb:226:in
block in open_loop' from C:/Ruby26-x64/lib/ruby/2.6.0/open-uri.rb:224:incatch' from C:/Ruby26-x64/lib/ruby/2.6.0/open-uri.rb:224:in
open_loop' from C:/Ruby26-x64/lib/ruby/2.6.0/open-uri.rb:165:inopen_uri' from C:/Ruby26-x64/lib/ruby/2.6.0/open-uri.rb:736:in
open' from C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/wayback_machine_downloader-2.3.1/lib/wayback_machine_downloader/archive_api.rb:13:inget_raw_list_from_api' from C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/wayback_machine_downloader-2.3.1/lib/wayback_machine_downloader.rb:92:in
block in get_all_snapshots_to_consider' from C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/wayback_machine_downloader-2.3.1/lib/wayback_machine_downloader.rb:91:intimes' from C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/wayback_machine_downloader-2.3.1/lib/wayback_machine_downloader.rb:91:in
get_all_snapshots_to_consider' from C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/wayback_machine_downloader-2.3.1/lib/wayback_machine_downloader.rb:131:inget_file_list_all_timestamps' from C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/wayback_machine_downloader-2.3.1/lib/wayback_machine_downloader.rb:158:in
get_file_list_by_timestamp' from C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/wayback_machine_downloader-2.3.1/lib/wayback_machine_downloader.rb:309:in `file_list_by_timestamp'