hartator / wayback-machine-downloader

Download an entire website from the Wayback Machine.
Other
5.34k stars 709 forks source link

When file exists, the program exits #8

Closed SGudbrandsson closed 8 years ago

SGudbrandsson commented 9 years ago

Hey there,

Nice piece of software!! :)

I found a bug though. When downloading item 4053, the file already existed as a single file, thus a folder could not be created.

Here's the error: `http://REDACTED.com/uncategorized/reflective-thoughts-on-marriage/ -> websites/REDACTED.com/uncategorized/reflective-thoughts-on-marriage/index.html (4052/48177) http://REDACTED.com/uncategorized/doing-the-important-stuff/ -> websites/REDACTED.com/uncategorized/doing-the-important-stuff/index.html (4053/48177)

File exists - websites/REDACTED.com/www.REDACTED2.com

/usr/lib/ruby/1.9.1/fileutils.rb:1515:in stat': No such file or directory - File exists - websites/REDACTED.com/www.REDACTED2.com (Errno::ENOENT) from /usr/lib/ruby/1.9.1/fileutils.rb:1515:inblock in fu_each_src_dest' from /usr/lib/ruby/1.9.1/fileutils.rb:1531:in fu_each_src_dest0' from /usr/lib/ruby/1.9.1/fileutils.rb:1513:infu_each_src_dest' from /usr/lib/ruby/1.9.1/fileutils.rb:508:in mv' from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.15/lib/wayback_machine_downloader.rb:116:inrescue in structure_dir_path' from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.15/lib/wayback_machine_downloader.rb:109:in structure_dir_path' from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.15/lib/wayback_machine_downloader.rb:83:inblock in download_files' from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.15/lib/wayback_machine_downloader.rb:66:in each' from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.15/lib/wayback_machine_downloader.rb:66:indownload_files' from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.15/bin/wayback_machine_downloader:27:in <top (required)>' from /usr/local/bin/wayback_machine_downloader:23:inload' from /usr/local/bin/wayback_machine_downloader:23:in <main>'

Here's an ls of the file ubuntu@ip-172-30-0-198:~$ ll websites/REDACTED.com/www.REDACTED2.com -rw-rw-r-- 1 ubuntu ubuntu 27286 Sep 4 07:44 websites/REDACTED.com/www.REDACTED2.com

hartator commented 9 years ago

That's a bit odd, it should actually rewrite the duplicate file into a directory and name it index.html

Do you sharing with me the problematic website? This way I can try to replicate on my own computer.

My email: replace_by_github_username@gmail.com

hartator commented 9 years ago

Do you mind sending me the backtrace non-redacted as well?

Just copy/paste without any edition, I am not able to reproduce.

SGudbrandsson commented 9 years ago

http://REDACTED.com/uncategorized/this-weeks-goals/ # websites/ REDACTED.com/uncategorized/this-weeks-goals/index.html already exists. (4048/48177) http://www.REDACTED.com/79/affiliate-internet-marketing-campaign-kicks-off-great-bonuses/

websites/

REDACTED.com/79/affiliate-internet-marketing-campaign-kicks-off-great-bonuses/index.html already exists. (4049/48177) http://www.REDACTED.com/78/at-last-a-bloggers-path-to-making-internet-marketing-money/

websites/

REDACTED.com/78/at-last-a-bloggers-path-to-making-internet-marketing-money/index.html already exists. (4050/48177) http://www.REDACTED.com/72/ewen-chia-the-internet-marketing-and-affiliate-marketing-guru/

websites/

REDACTED.com/72/ewen-chia-the-internet-marketing-and-affiliate-marketing-guru/index.html already exists. (4051/48177) http://REDACTED.com/uncategorized/reflective-thoughts-on-marriage/ # websites/ REDACTED.com/uncategorized/reflective-thoughts-on-marriage/index.html already exists. (4052/48177) http://REDACTED.com/uncategorized/doing-the-important-stuff/ # websites/ REDACTED.com/uncategorized/doing-the-important-stuff/index.html already exists. (4053/48177)

File exists - websites/

REDACTED.com/www.REDACTED2.com /usr/lib/ruby/1.9.1/fileutils.rb:1515:in `stat': No such file or directory

I tried to restart the process like you mentioned in another thread, however I got the same output as before ...

The server and software information: ubuntu@ip-172-30-0-198:~$ wayback_machine_downloader -v 0.1.15 ubuntu@ip-172-30-0-198:~$ ruby -v ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-linux] ubuntu@ip-172-30-0-198:~$ uname -a Linux ip-172-30-0-198 3.13.0-48-generic #80-Ubuntu SMP Thu Mar 12 11:16:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux ubuntu@ip-172-30-0-198:~$ cat /etc/ Display all 178 possibilities? (y or n) ubuntu@ip-172-30-0-198:~$ cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=14.04 DISTRIB_CODENAME=trusty DISTRIB_DESCRIPTION="Ubuntu 14.04.2 LTS"

All the best, Siggy ᐧ

On Sat, Sep 5, 2015 at 4:33 AM, hartator notifications@github.com wrote:

Do you mind sending me the backtrace non-redacted as well?

Just copy/past without any edition, I am not able to reproduce.

— Reply to this email directly or view it on GitHub https://github.com/hartator/wayback-machine-downloader/issues/8#issuecomment-137910021 .

SGudbrandsson commented 9 years ago

I managed to fix the previous error by creating the folder by hand, however I hit a bug when I continued to robots.txt

http://www.REDACTED.com/tag/barack-obama/ # websites/ REDACTED.com/tag/barack-obama/index.html already exists. (9743/48177)

File exists - websites/REDACTED.com/robots.txt

/usr/lib/ruby/1.9.1/fileutils.rb:1515:in `stat': No such file or directory

On Sat, Sep 5, 2015 at 12:44 PM, Sigurður Guðbrandsson < sigurdur@sigginet.info> wrote:

http://REDACTED.com/uncategorized/this-weeks-goals/ # websites/ REDACTED.com/uncategorized/this-weeks-goals/index.html already exists. (4048/48177)

http://www.REDACTED.com/79/affiliate-internet-marketing-campaign-kicks-off-great-bonuses/

websites/

REDACTED.com/79/affiliate-internet-marketing-campaign-kicks-off-great-bonuses/index.html already exists. (4049/48177)

http://www.REDACTED.com/78/at-last-a-bloggers-path-to-making-internet-marketing-money/

websites/

REDACTED.com/78/at-last-a-bloggers-path-to-making-internet-marketing-money/index.html already exists. (4050/48177)

http://www.REDACTED.com/72/ewen-chia-the-internet-marketing-and-affiliate-marketing-guru/

websites/

REDACTED.com/72/ewen-chia-the-internet-marketing-and-affiliate-marketing-guru/index.html already exists. (4051/48177) http://REDACTED.com/uncategorized/reflective-thoughts-on-marriage/

websites/

REDACTED.com/uncategorized/reflective-thoughts-on-marriage/index.html already exists. (4052/48177) http://REDACTED.com/uncategorized/doing-the-important-stuff/ # websites/ REDACTED.com/uncategorized/doing-the-important-stuff/index.html already exists. (4053/48177)

File exists - websites/

REDACTED.com/www.REDACTED2.com /usr/lib/ruby/1.9.1/fileutils.rb:1515:in `stat': No such file or directory

  • File exists - websites/ REDACTED.com/www.REDACTED2.com (Errno::ENOENT) from /usr/lib/ruby/1.9.1/fileutils.rb:1515:in block in fu_each_src_dest' from /usr/lib/ruby/1.9.1/fileutils.rb:1531:infu_each_src_dest0' from /usr/lib/ruby/1.9.1/fileutils.rb:1513:in fu_each_src_dest' from /usr/lib/ruby/1.9.1/fileutils.rb:508:inmv' from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.15/lib/wayback_machine_downloader.rb:116:in rescue in structure_dir_path' from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.15/lib/wayback_machine_downloader.rb:109:in structure_dir_path' from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.15/lib/wayback_machine_downloader.rb:83:in block in download_files' from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.15/lib/wayback_machine_downloader.rb:66:in each' from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.15/lib/wayback_machine_downloader.rb:66:in download_files' from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.15/bin/wayback_machine_downloader:27:in <top (required)>' from /usr/local/bin/wayback_machine_downloader:23:in load' from /usr/local/bin/wayback_machine_downloader:23:in
    ' ubuntu@ip-172-30-0-198:~$

I tried to restart the process like you mentioned in another thread, however I got the same output as before ...

The server and software information: ubuntu@ip-172-30-0-198:~$ wayback_machine_downloader -v 0.1.15 ubuntu@ip-172-30-0-198:~$ ruby -v ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-linux] ubuntu@ip-172-30-0-198:~$ uname -a Linux ip-172-30-0-198 3.13.0-48-generic #80-Ubuntu SMP Thu Mar 12 11:16:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux ubuntu@ip-172-30-0-198:~$ cat /etc/ Display all 178 possibilities? (y or n) ubuntu@ip-172-30-0-198:~$ cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=14.04 DISTRIB_CODENAME=trusty DISTRIB_DESCRIPTION="Ubuntu 14.04.2 LTS"

All the best, Siggy ᐧ

On Sat, Sep 5, 2015 at 4:33 AM, hartator notifications@github.com wrote:

Do you mind sending me the backtrace non-redacted as well?

Just copy/past without any edition, I am not able to reproduce.

— Reply to this email directly or view it on GitHub https://github.com/hartator/wayback-machine-downloader/issues/8#issuecomment-137910021 .

SGudbrandsson commented 9 years ago

Found the offending code and fixed it .. (at least in my case - you might have to add some if/then statements for parsing the input string correctly) https://github.com/hartator/wayback-machine-downloader/pull/11