mirror / wget

Wget Git mirror
GNU General Public License v3.0
387 stars 131 forks source link

Add option to download inline links directly after their parent page when downloading recursively #2

Open john-peterson opened 9 years ago

john-peterson commented 9 years ago

because it's more likely to download temporary links before they expire because it's more similar to the browsing experience

difference from browsing experience cause problem

the problem is described in #1

discarded LIFO solution

a convoluted LIFO solution with a similar result (the difference is that LIFO result in a bottom-to-top download order) is in #1

non-inline links aren't prepended

ATTR_HTML links are never prepended and might contain non-html files that aren't downloaded directly after its parent page even with browser queue type

if these cause a problem an option to read the header Content-Type of the link before enqueuing could be added

test

testenv

this check the link order

./Test--spider-r-browser.py

Test Passed.

changing browser to fifo in Test--spider-r-browser.py fail the test

./Test--spider-r-browser.py

Expected: [['HEAD /', 'GET /', 'GET /robots.txt', 'HEAD /image.svg', 'HEAD /secondpage.html', 'GET /secondpage.html', 'HEAD /nonexistent', 'HEAD /thirdpage.html', 'GET /thirdpage.html', 'HEAD /dummy.txt', 'HEAD /againnonexistent']]
Got: [['HEAD /', 'GET /', 'GET /robots.txt', 'HEAD /secondpage.html', 'GET /secondpage.html', 'HEAD /image.svg', 'HEAD /nonexistent', 'HEAD /thirdpage.html', 'GET /thirdpage.html', 'HEAD /dummy.txt', 'HEAD /againnonexistent']]
Error: Not all files were crawled correctly..

download order

this show the download order for the test page in https://github.com/mirror/wget/pull/1 /test

the current code sometimes download links long after its parent page (see https://github.com/mirror/wget/pull/1 /test)

this patch download links directly after its parent page

wget -vdrp -nd --queue-type=browser http://localhost/code/html/test/download/i.html 2>&1 | egrep "^Appending|^Prepending|Dequeuing|Saving to"

Appending http://localhost/code/html/test/download/i.html at depth 0
Dequeuing http://localhost/code/html/test/download/i.html at depth 0
Saving to: ‘i.html’
Appending http://localhost/code/html/test/download/a/a.html at depth 1
Prepending http://localhost/code/html/test/download/x.jpg at depth 1
Appending http://localhost/code/html/test/download/b/b.html at depth 1
Prepending http://localhost/code/html/test/download/y.jpg at depth 1
Dequeuing http://localhost/code/html/test/download/y.jpg at depth 1
Saving to: ‘y.jpg’
Dequeuing http://localhost/code/html/test/download/x.jpg at depth 1
Saving to: ‘x.jpg’
Dequeuing http://localhost/code/html/test/download/a/a.html at depth 1
Saving to: ‘a.html’
Appending http://localhost/code/html/test/download/a/a/a-a.html at depth 2
Prepending http://localhost/code/html/test/download/a/a-x.jpg at depth 2
Appending http://localhost/code/html/test/download/a/b/a-b.html at depth 2
Prepending http://localhost/code/html/test/download/a/a-y.jpg at depth 2
Dequeuing http://localhost/code/html/test/download/a/a-y.jpg at depth 2
Saving to: ‘a-y.jpg’
Dequeuing http://localhost/code/html/test/download/a/a-x.jpg at depth 2
Saving to: ‘a-x.jpg’
Dequeuing http://localhost/code/html/test/download/b/b.html at depth 1
Saving to: ‘b.html’
Appending http://localhost/code/html/test/download/b/a/b-a.html at depth 2
Prepending http://localhost/code/html/test/download/b/b-x.jpg at depth 2
Appending http://localhost/code/html/test/download/b/b/b-b.html at depth 2
Prepending http://localhost/code/html/test/download/b/b-y.jpg at depth 2
Dequeuing http://localhost/code/html/test/download/b/b-y.jpg at depth 2
Saving to: ‘b-y.jpg’
Dequeuing http://localhost/code/html/test/download/b/b-x.jpg at depth 2
Saving to: ‘b-x.jpg’
Dequeuing http://localhost/code/html/test/download/a/a/a-a.html at depth 2
Saving to: ‘a-a.html’
Prepending http://localhost/code/html/test/download/a/a/a-a-x.jpg at depth 3
Prepending http://localhost/code/html/test/download/a/a/a-a-y.jpg at depth 3
Dequeuing http://localhost/code/html/test/download/a/a/a-a-y.jpg at depth 3
Saving to: ‘a-a-y.jpg’
Dequeuing http://localhost/code/html/test/download/a/a/a-a-x.jpg at depth 3
Saving to: ‘a-a-x.jpg’
Dequeuing http://localhost/code/html/test/download/a/b/a-b.html at depth 2
Saving to: ‘a-b.html’
Prepending http://localhost/code/html/test/download/a/b/a-b-x.jpg at depth 3
Prepending http://localhost/code/html/test/download/a/b/a-b-y.jpg at depth 3
Dequeuing http://localhost/code/html/test/download/a/b/a-b-y.jpg at depth 3
Saving to: ‘a-b-y.jpg’
Dequeuing http://localhost/code/html/test/download/a/b/a-b-x.jpg at depth 3
Saving to: ‘a-b-x.jpg’
Dequeuing http://localhost/code/html/test/download/b/a/b-a.html at depth 2
Saving to: ‘b-a.html’
Prepending http://localhost/code/html/test/download/b/a/b-a-x.jpg at depth 3
Prepending http://localhost/code/html/test/download/b/a/b-a-y.jpg at depth 3
Dequeuing http://localhost/code/html/test/download/b/a/b-a-y.jpg at depth 3
Saving to: ‘b-a-y.jpg’
Dequeuing http://localhost/code/html/test/download/b/a/b-a-x.jpg at depth 3
Saving to: ‘b-a-x.jpg’
Dequeuing http://localhost/code/html/test/download/b/b/b-b.html at depth 2
Saving to: ‘b-b.html’
Prepending http://localhost/code/html/test/download/b/b/b-b-x.jpg at depth 3
Prepending http://localhost/code/html/test/download/b/b/b-b-y.jpg at depth 3
Dequeuing http://localhost/code/html/test/download/b/b/b-b-y.jpg at depth 3
Saving to: ‘b-b-y.jpg’
Dequeuing http://localhost/code/html/test/download/b/b/b-b-x.jpg at depth 3
Saving to: ‘b-b-x.jpg’
john-peterson commented 9 years ago

prepend non-HTML links

I suggest
    while (1)
        url_dequeue

        if (descend)
            for (; child; child = child->next)
                if child is HTML
                  url_enqueue_append
                else
                  url_enqueue_prepend

k i get it. thats a better solution. i pushed it to https://github.com/mirror/wget/pull/2

bandwidth

BTW, since you are talking about timing, people with different bandwidths will/might experience different results when viewing such a page. I use mobile internet with 2G/3G <= 64kbit/s. I wonder how that works ;-)

low bandwidth makes this patch more important because low bandwidth increase the time between parent and child download

and most pages should support low bandwidth browsing. f.e. imagevenue probably have a long enough timeout to allow img.php to load its image with low bandwidth too

john-peterson commented 9 years ago

git patch

in order to add your patch to wget, could you please send the complete patch in git's format-patch format ?

its in https://github.com/mirror/wget/pull/2.patch

john-peterson commented 9 years ago

make it optional

What about the option to switch the behaviour + docs.

i can add that. what do u wanna call the option?

john-peterson commented 9 years ago

make it optional

you already have that in your 'recurse' branch. Take that, undo the changes to src/recur.c and take your changes from 'recurse2' branch.

k it's done https://github.com/mirror/wget/pull/2/files

debug message

you would also have to add the relevant debug statements to ensure that when we look at a debug log, we know exactly what is happening. The current DEBUGP message will have to be changed to reflect the new working.

i changed the DEBUGP that say Enqueuing to Appending/Prepending to tell whats happening

test

I would like to see a test case that shows how the two differ.

there's a test in https://github.com/mirror/wget/pull/2#issue-53548650 /test that compare them

i release the copyright

You will be required to sign the copyright assignment agreement with the FSF before we can merge this patch into our codebase.

sounds like a lot of work. i release the copyright. i dont care

john-peterson commented 9 years ago

git patch

Please share the patch directly with us, in git format-patch style.

its in https://github.com/mirror/wget/pull/2.patch

testenv

No, I meant a live test for the Wget Test Suite. You'll find the test suite in the testenv/ directory. We need to add a test that ensures that the Wget downloads files in the correct order as expected.

i added a test that fail unless the browser queue order is used

john-peterson commented 9 years ago

wget it

Please attach the patch to the email you send to this list. NOT a link to the patch. Not all of us use a browser for looking at emails.

u can download it with wget

wget https://github.com/mirror/wget/pull/2.patch
john-peterson commented 9 years ago

attach it

Still not going to review the patch. Turns out, I don't know how to use the command you gave me. All I do know is, if the patch were attached to the email, I'd be able to read it.

k i attached it