hartator / wayback-machine-downloader

Download an entire website from the Wayback Machine.
Other
5.33k stars 706 forks source link

Failed to open TCP connection to web.archive.org:443 (No connection could be made because the target machine actively refused it. - connect(2) for "web.archive.org" port 443) (Errno::ECONNREFUSED) #304

Open ghiathkamel opened 3 months ago

ghiathkamel commented 3 months ago

Hello

The tool not working anymore

Getting snapshot pages....................C:/Ruby26-x64/lib/ruby/2.6.0/net/http.rb:949:in rescue in block in connect': Failed to open TCP connection to web.archive.org:443 (No connection could be made because the target machine actively refused it. - connect(2) for "web.archive.org" port 443) (Errno::ECONNREFUSED) from C:/Ruby26-x64/lib/ruby/2.6.0/net/http.rb:946:inblock in connect' from C:/Ruby26-x64/lib/ruby/2.6.0/timeout.rb:93:in block in timeout' from C:/Ruby26-x64/lib/ruby/2.6.0/timeout.rb:103:intimeout' from C:/Ruby26-x64/lib/ruby/2.6.0/net/http.rb:945:in connect' from C:/Ruby26-x64/lib/ruby/2.6.0/net/http.rb:930:indo_start' from C:/Ruby26-x64/lib/ruby/2.6.0/net/http.rb:919:in start' from C:/Ruby26-x64/lib/ruby/2.6.0/open-uri.rb:337:inopen_http' from C:/Ruby26-x64/lib/ruby/2.6.0/open-uri.rb:756:in buffer_open' from C:/Ruby26-x64/lib/ruby/2.6.0/open-uri.rb:226:inblock in open_loop' from C:/Ruby26-x64/lib/ruby/2.6.0/open-uri.rb:224:in catch' from C:/Ruby26-x64/lib/ruby/2.6.0/open-uri.rb:224:inopen_loop' from C:/Ruby26-x64/lib/ruby/2.6.0/open-uri.rb:165:in open_uri' from C:/Ruby26-x64/lib/ruby/2.6.0/open-uri.rb:736:inopen' from C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/wayback_machine_downloader-2.3.1/lib/wayback_machine_downloader/archive_api.rb:13:in get_raw_list_from_api' from C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/wayback_machine_downloader-2.3.1/lib/wayback_machine_downloader.rb:92:inblock in get_all_snapshots_to_consider' from C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/wayback_machine_downloader-2.3.1/lib/wayback_machine_downloader.rb:91:in times' from C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/wayback_machine_downloader-2.3.1/lib/wayback_machine_downloader.rb:91:inget_all_snapshots_to_consider' from C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/wayback_machine_downloader-2.3.1/lib/wayback_machine_downloader.rb:131:in get_file_list_all_timestamps' from C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/wayback_machine_downloader-2.3.1/lib/wayback_machine_downloader.rb:158:inget_file_list_by_timestamp' from C:/Ruby26-x64/lib/ruby/gems/2.6.0/gems/wayback_machine_downloader-2.3.1/lib/wayback_machine_downloader.rb:309:in `file_list_by_timestamp'

phil-hudson commented 3 months ago

+1

niclake commented 2 months ago

Came here to see if anyone else had the same issues. Did web.archive.org nuke the endpoint recently?

ghiathkamel commented 2 months ago

Came here to see if anyone else had the same issues. Did web.archive.org nuke the endpoint recently?

I think they blocked the downloader

SHiLLySiT commented 2 months ago

This tool is listed on the Archive Wiki so I'd be interested to hear if this was an intended blocking of the tool.

EDIT: Tool hasn't been blocked, but rather I think hasn't been updated to reflect changes on the Wayback Machine. This other issue provides instructions on how to use a fork has the necessary fixes applied.

GregLeonhardt commented 2 months ago

It appears that the wayback server has been overwhelmed by download activity and they are actively attempting to reduce traffic. I have made the following modifications to wayback_machine_downloader to slow it down which significantly reduces but does not eliminate the problem. To enable the downloading of all pages a retry was added for the few connection refused errors that still occur. I suspect that slowing it down even more would also eliminate the errors but this is a compromise between speed and playing nice.

First locate the ruby file by running the following command: gem env

The source file "wayback_machine_downloader.rb" should be located in one of the GEM PATHS.

With your editor of choice open wayback_machine_downloader.rb

    unless File.exist? file_path
      begin
        structure_dir_path dir_path
        open(file_path, "wb") do |file|
          begin
            URI("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}").open("Accept-Encoding" => "plain") do |uri|
              file.write(uri.read)
            end
          rescue OpenURI::HTTPError => e
            puts "(1) - #{file_url} # #{e}"
            if @all
              file.write(e.io.read)
              puts "(2) - #{file_path} saved anyway."
            end
          rescue StandardError => e
            puts "(3) - #{file_url} # #{e}"
            sleep(30)                                                <<< INSERT
            retry                                                    <<< INSERT
          end
        end
      rescue StandardError => e
        puts "(4) - #{file_url} # #{e}"
      ensure
        if not @all and File.exist?(file_path) and File.size(file_path) == 0
          File.delete(file_path)
          puts "(5) - #{file_path} was empty and was removed."
        end
      end
      semaphore.synchronize do
        @processed_file_count += 1
        puts "(6) - #{file_url} -> #{file_path} (#{@processed_file_count}/#{file_list_by_timestamp.size})"
      end
      sleep(2)                                                       <<< INSERT
    else
      semaphore.synchronize do
        @processed_file_count += 1
        puts "(7) - #{file_url} # #{file_path} already exists. (#{@processed_file_count}/#{file_list_by_timestamp.size})"
      end
    end
  end