rockdaboot / wget2

The successor of GNU Wget. Contributions preferred at https://gitlab.com/gnuwget/wget2. But accepted here as well 😍
GNU Lesser General Public License v3.0
539 stars 73 forks source link

Problems downloading Wiki #208

Closed frankenstein91 closed 4 years ago

frankenstein91 commented 4 years ago

Today I wanted to download a wiki site for my father to use offline. wget2 --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains gramps-project.org --no-parent -e robots=off "https://gramps-project.org/wiki/index.php/De:Handbuch"

Unfortunately the download stops after a short time. The normal wget works off the website more reliably. my version

GNU Wget2 1.99.2 - multithreaded metalink/file/website downloader

+digest +https +ssl/gnutls +ipv6 +iri +large-file +nls -ntlm -opie +psl -hsts
+iconv +idn2 +zlib +lzma +brotlidec +zstd +bzip2 -lzip +http2 +gpgme

Copyright (C) 2012-2015 Tim Ruehsen
Copyright (C) 2015-2019 Free Software Foundation, Inc.
rockdaboot commented 4 years ago

It looks like --html-extension doesn't work together with --convert-links. Could you leave away --html-extension and try again ? I will open an issue at https://gitlab.com/gnuwget/wget2/issues.

rockdaboot commented 4 years ago

Looks like there already is one: https://gitlab.com/gnuwget/wget2/issues/423

frankenstein91 commented 4 years ago

I do not use gitlab. Thank you

frankenstein91 commented 4 years ago

hello all

I have tested the command

wget2 --recursive --no-clobber --page-requisites --convert-links --restrict-file-names=windows --domains gramps-project.org --no-parent -e robots=off https://gramps-project.org/wiki/

I have the same Problem, that not all pages are downloaded. The most things are missing

gramps-project.org
├── gramps.jpg
└── wiki
    ├── images
    │   ├── 1
    │   │   └── 16
    │   │       └── Gramps-config-language.png
    │   ├── 9
    │   │   └── 9e
    │   │       └── Gramps-release.png
    │   ├── b
    │   │   └── b1
    │   │       └── Gramps-genea.png
    │   └── d
    │       └── d7
    │           └── Gramps-logo.png
    ├── index.html
    ├── load.php%3Fdebug=false&lang=en&modules=mediawiki.legacy.commonPrint,shared%7Cmediawiki.sectionAnchor%7Cmediawiki.skinning.interface%7Cskins.vector.styles&only=styles&skin=vector
    ├── load.php%3Fdebug=false&lang=en&modules=site.styles&only=styles&skin=vector
    ├── load.php%3Fdebug=false&lang=en&modules=startup&only=scripts&skin=vector
    ├── resources
    │   ├── assets
    │   │   ├── licenses
    │   │   │   └── gnu-fdl.png
    │   │   ├── poweredby_mediawiki_132x47.png
    │   │   ├── poweredby_mediawiki_176x62.png
    │   │   └── poweredby_mediawiki_88x31.png
    │   └── src
    │       ├── mediawiki.legacy
    │       │   └── images
    │       │       ├── ajax-loader.gif%3F57f34
    │       │       └── spinner.gif%3Fca65b
    │       └── mediawiki.skinning
    │           └── images
    │               ├── magnify-clip-ltr.png%3F4f704
    │               └── magnify-clip-rtl.png%3Fa9fb3
    └── skins
        └── Vector
            └── images
                ├── arrow-down.png%3F42edd
                ├── bullet-icon.png%3Fe31f8
                ├── external-link-ltr-icon.png%3F325de
                ├── page-fade.png%3F1d168
                ├── portal-break.png%3F3ea1b
                ├── search-fade.png%3F50f7b
                ├── search-ltr.png%3F39f97
                ├── tab-break.png%3F09d4b
                ├── tab-current-fade.png%3F22887
                ├── tab-normal-fade.png%3F1cc52
                ├── unwatch-icon-hl.png%3Fc4723
                ├── unwatch-icon.png%3Ffccbe
                ├── user-icon.png%3F13155
                ├── watch-icon-hl.png%3Ff4c7e
                ├── watch-icon-loading.png%3F5cb92
                └── watch-icon.png%3Fe1b42

21 directories, 33 files

The normal wget is loading more than 10000 files.

rockdaboot commented 4 years ago

Just pushed two fixes to Wget2.

But also there are some specials on that website... http/2 returns response code 500 after a while (looks like a bug on server-side throttling), so use --no-http2. Also the site advertises Update: headers which caused a slow-down due to a bug in wget2, that's fixed as well.

I am currently at ~6.500 files, still downloading. I'll let you know once it works out.

frankenstein91 commented 4 years ago

currently running wget2 --recursive --no-clobber --page-requisites --convert-links --restrict-file-names=windows --no-http2 --domains gramps-project.org --no-parent -e robots=off https://gramps-project.org/wiki/

looks good until now. do you think we will get --html-extension working too?

frankenstein91 commented 4 years ago

Currently the software gives me the impression of being stuck

rockdaboot commented 4 years ago

Sure, but not as fast, I guess. Your command was a good test - I recognized several small issues that I would like to track down first.

rockdaboot commented 4 years ago

Currently the software gives me the impression of being stuck

I saw that once, but couldn't reproduce (But I have a backtrace from gdb - wget2 waits for a lock). There possibly is an issue with --convert-links which might clobber / change a mutex/lock. Or it#s a gnulib regression - they changed a lot of the multi-threading wrappers recently. I have a closer look tomorrow.

frankenstein91 commented 4 years ago

i had to kill -9 it very hard. Ctrl + C was not working for me

darnir commented 4 years ago

That's a known issue in some cases when you have a flaky connection. The main thread sets a flag for all the downloaders to terminate. But then it has to wait till the last Downloader thread returns. And one of them could be stuck for minutes before it times out.

On Mon, Oct 7, 2019, at 9:19 PM, frankenstein91 wrote:

i had to kill -9 it very hard. Ctrl + C was not working for me

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rockdaboot/wget2/issues/208?email_source=notifications&email_token=AAMVJNO7BXONCCCPF3326JTQNODVVA5CNFSM4I54V2K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEARP2SY#issuecomment-539163979, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMVJNKOWNI7R2FTOQUKCFTQNODVVANCNFSM4I54V2KQ.

frankenstein91 commented 4 years ago

I tried it again over another faster www connection. Unfortunately I get stuck on the same file.

https://gramps-project.org/wiki/index.php?title=What_to_do_for_a_release&mobileaction=toggle_view_desktop

and i see a lot of errors like Cannot resolve URI 'data:image/jpeg;base64,/9j/4AAQSkZJRgABAgAAZABkAAD/7AARRHVja3kAAQAEAAAAVQAA/+4ADkFkb2JlAGTAAAAAAf/bAIQAAgEBAQEBAgEBAgMCAQIDAwICAgIDAwMDAwMDAwUDBAQEBAMFBQUGBgYFBQcHCAgHBwoKCgoKDAwMDAwMDAwMDAECAgIEAwQHBQUHCggHCAoMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwM/8AAEQgAiAAFAwERAAIRAQMRAf/EAIcAAAIBBQAAAAAAAAAAAAAAAAMECAABBQYHAQEAAQQDAAAAAAAAAAAAAAADAgABBAUGCAkQAAEBAwYMBwAAAAAAAAAAAAATAWESETFRAnIVIXGR0fEiMlJiIwUY1KUGFlYHCBEBAAIAAwcFAAAAAAAAAAAAABESAQMTUZFSotJUF6PTFAUG/9oADAMBAAIRAxEAPwCaUDDzyh3JkZMlA7GEXEoDJtEuOxpJxOGPYykJUFh0hoFYdMWopHTFgciQMEhCUDe9P9MfJfL+leHPQfwf+X7T1M73HTnyn993HJl9Cf8A23/REUXsf0/FtS3N06C0zkzOpyHK/lZnFjvcS0cNmDqlycL5JNW0zNoNfdsNJtV01d182DGYd2x02YTqUAyylyylFKf/2Q=='

rockdaboot commented 4 years ago

These are not really errors. It's embedded images. The messages just say that wget2 found a URI (data:...), but can't do anything with it. It's the same with embedded email adresses like mailto:....

One issue that I detected in the logs is when wget2 e.g. have downloaded foo/bar and then it finds and downloads a file foo. Since foo already exists as a directory name, we normally would save foo as foo.1. But --no-clobber prevents this (there is an error) and wget2 gets out of sync with the next server response (if the connection is kept alive). Getting stuck might be a symptom of this or something else.

I just got (another) flu, so not sure when able to work on this.

rockdaboot commented 4 years ago

@frankenstein91 --html-extension should now work too, the website has an issue with http/2 (some kind of weird throttling), so don't forget to use --no-http2.