mirror / wget

Wget Git mirror
GNU General Public License v3.0
394 stars 134 forks source link

Add LIFO queue option for recursive download #1

Closed john-peterson closed 9 years ago

john-peterson commented 11 years ago

basic problem

the basic problem is that the FIFO queue can create a long time between downloading a page and its links. this is different from the browser experience that the page is designed for. resulting in wget fail that a browser user dont experience

savannah link

this patch is also posted at https://savannah.gnu.org/bugs/?37581

making it optional

To get your patch into git please add a command-line option to activate LIFO behavior.

k the patch is changed here https://github.com/mirror/wget/pull/1

the patch file is https://github.com/mirror/wget/pull/1.patch

reason to place html pages at the top of the queue

if ll_bubblesort isn't used only the deepest level links are downloaded directly after its parent page despite using LIFO

alternative solution

enqueue child directly after parent seem difficult

another solution is to enqueue the depth n+1 links directly after enqueuing its parent depth n link instead of continuing enqueuing depth n links

this require interrupting the depth n enqueue at html links. dequeue everything (including the html link). enqueue the depth n+1 links. and the continue the depth n enqueue. this require a big reorganization or doesnt make sense

a way to do this could be to store the non-enqueued links in a temporary queue and enqueue them after everything else

the LIFO solution is better than this solution bc

keeping FIFO and enqueue html links last (with sort) doesnt solve the problem because all depth n links are still downloaded before any depth n+1 links

test case description

I am not sure why you expect that all the resources from 60 "branches" can be downloaded in less than 60s when the "branches" itself can't.

i dont mean that all resources can be downloaded fast. i just mean that they are downloaded directly after the page that contain them

the example is an image hosting site (imagevenue.com) where all images has its own html page (imagevenue.com/img.php) with a generated image link that expires a while after the html page is generated to prevent links directly to image files

all links can be downloaded with lifo because each branch page has only 1 link in this example and there's more than enough time to download that 1 link if the download begin directly after the link is generated

if a branch page (f.e. imagevenue.com/img.php) had many images (links) there could still be a problem. but the problem would be the same for regular users (browsers) that download the resource directly after the page is loaded and the fault is therefore the site's rather than wget's

test

imagevenue fail

this fails to download the imagevenue.com/img.php images because it's downloading all the img.php pages before the temporary image links in them, and by the time it gets to them they're expired

wget -rHpE -l1 -t2 -T10 -np -nc -nH -nd -e robots=off -D'imagevenue.com' -R'th_*.jpg,th_*.JPG,.gif,.png,.css,.js' http://forum.glam0ur.com/hot-babe-galleries/11956-merilyn-sekova-aka-busty-merilyn.html

this downloads images directly after a img.php page is downloaded so they dont have time to expire

wget -rHpE -l1 -t2 -T10 -np -nc -nH -nd --queue-type=lifo -e robots=off -D'imagevenue.com' -R'th_*.jpg,th_*.JPG,.gif,.png,.css,.js' http://forum.glam0ur.com/hot-babe-galleries/11956-merilyn-sekova-aka-busty-merilyn.html

invalid input

invalid input is prevented

wget --queue-type=fiffo

wget: --queue-type: Invalid value ‘fiffo’.

download order

this test show the FIFO and LIFO download order

i created this local site:

$ tree
.
├── a
│   ├── a
│   │   ├── a-a-x.jpg
│   │   ├── a-a-y.jpg
│   │   └── a-a.html
│   ├── a-x.jpg
│   ├── a-y.jpg
│   ├── a.html
│   └── b
│       ├── a-b-x.jpg
│       ├── a-b-y.jpg
│       └── a-b.html
├── b
│   ├── a
│   │   ├── b-a-x.jpg
│   │   ├── b-a-y.jpg
│   │   └── b-a.html
│   ├── b
│   │   ├── b-b-x.jpg
│   │   ├── b-b-y.jpg
│   │   └── b-b.html
│   ├── b-x.jpg
│   ├── b-y.jpg
│   └── b.html
├── i.html
├── x.jpg
└── y.jpg

6 directories, 21 files

i.html

<a href="a/a.html"><img src="x.jpg"></a>
<a href="b/b.html"><img src="y.jpg"></a>

a.html

<a href="a/a-a.html"><img src="a-x.jpg"></a>
<a href="b/a-b.html"><img src="a-y.jpg"></a>

a-a.html

<img src="a-a-x.jpg">
<img src="a-a-y.jpg">

a-b.html

<img src="a-b-x.jpg">
<img src="a-b-y.jpg">

b.html

<a href="a/b-a.html"><img src="b-x.jpg"></a>
<a href="b/b-b.html"><img src="b-y.jpg"></a>

b-a.html

<img src="b-a-x.jpg">
<img src="b-a-y.jpg">

b-b.html

<img src="b-b-x.jpg">
<img src="b-b-y.jpg">

fifo download links long after its parent page. especially the deepest level links

wget -vdrp -nd http://localhost/code/html/test/download/i.html 2>&1 | egrep "^Enqueuing|Dequeuing|Saving to"

Enqueuing http://localhost/code/html/test/download/i.html at depth 0
Dequeuing http://localhost/code/html/test/download/i.html at depth 0
Saving to: ‘i.html’
Enqueuing http://localhost/code/html/test/download/a/a.html at depth 1
Enqueuing http://localhost/code/html/test/download/x.jpg at depth 1
Enqueuing http://localhost/code/html/test/download/b/b.html at depth 1
Enqueuing http://localhost/code/html/test/download/y.jpg at depth 1
Dequeuing http://localhost/code/html/test/download/a/a.html at depth 1
Saving to: ‘a.html’
Enqueuing http://localhost/code/html/test/download/a/a/a-a.html at depth 2
Enqueuing http://localhost/code/html/test/download/a/a-x.jpg at depth 2
Enqueuing http://localhost/code/html/test/download/a/b/a-b.html at depth 2
Enqueuing http://localhost/code/html/test/download/a/a-y.jpg at depth 2
Dequeuing http://localhost/code/html/test/download/x.jpg at depth 1
Saving to: ‘x.jpg’
Dequeuing http://localhost/code/html/test/download/b/b.html at depth 1
Saving to: ‘b.html’
Enqueuing http://localhost/code/html/test/download/b/a/b-a.html at depth 2
Enqueuing http://localhost/code/html/test/download/b/b-x.jpg at depth 2
Enqueuing http://localhost/code/html/test/download/b/b/b-b.html at depth 2
Enqueuing http://localhost/code/html/test/download/b/b-y.jpg at depth 2
Dequeuing http://localhost/code/html/test/download/y.jpg at depth 1
Saving to: ‘y.jpg’
Dequeuing http://localhost/code/html/test/download/a/a/a-a.html at depth 2
Saving to: ‘a-a.html’
Enqueuing http://localhost/code/html/test/download/a/a/a-a-x.jpg at depth 3
Enqueuing http://localhost/code/html/test/download/a/a/a-a-y.jpg at depth 3
Dequeuing http://localhost/code/html/test/download/a/a-x.jpg at depth 2
Saving to: ‘a-x.jpg’
Dequeuing http://localhost/code/html/test/download/a/b/a-b.html at depth 2
Saving to: ‘a-b.html’
Enqueuing http://localhost/code/html/test/download/a/b/a-b-x.jpg at depth 3
Enqueuing http://localhost/code/html/test/download/a/b/a-b-y.jpg at depth 3
Dequeuing http://localhost/code/html/test/download/a/a-y.jpg at depth 2
Saving to: ‘a-y.jpg’
Dequeuing http://localhost/code/html/test/download/b/a/b-a.html at depth 2
Saving to: ‘b-a.html’
Enqueuing http://localhost/code/html/test/download/b/a/b-a-x.jpg at depth 3
Enqueuing http://localhost/code/html/test/download/b/a/b-a-y.jpg at depth 3
Dequeuing http://localhost/code/html/test/download/b/b-x.jpg at depth 2
Saving to: ‘b-x.jpg’
Dequeuing http://localhost/code/html/test/download/b/b/b-b.html at depth 2
Saving to: ‘b-b.html’
Enqueuing http://localhost/code/html/test/download/b/b/b-b-x.jpg at depth 3
Enqueuing http://localhost/code/html/test/download/b/b/b-b-y.jpg at depth 3
Dequeuing http://localhost/code/html/test/download/b/b-y.jpg at depth 2
Saving to: ‘b-y.jpg’
Dequeuing http://localhost/code/html/test/download/a/a/a-a-x.jpg at depth 3
Saving to: ‘a-a-x.jpg’
Dequeuing http://localhost/code/html/test/download/a/a/a-a-y.jpg at depth 3
Saving to: ‘a-a-y.jpg’
Dequeuing http://localhost/code/html/test/download/a/b/a-b-x.jpg at depth 3
Saving to: ‘a-b-x.jpg’
Dequeuing http://localhost/code/html/test/download/a/b/a-b-y.jpg at depth 3
Saving to: ‘a-b-y.jpg’
Dequeuing http://localhost/code/html/test/download/b/a/b-a-x.jpg at depth 3
Saving to: ‘b-a-x.jpg’
Dequeuing http://localhost/code/html/test/download/b/a/b-a-y.jpg at depth 3
Saving to: ‘b-a-y.jpg’
Dequeuing http://localhost/code/html/test/download/b/b/b-b-x.jpg at depth 3
Saving to: ‘b-b-x.jpg’
Dequeuing http://localhost/code/html/test/download/b/b/b-b-y.jpg at depth 3
Saving to: ‘b-b-y.jpg’

lifo download links directly after its parent page

wget -vdrp -nd --queue-type=lifo http://localhost/code/html/test/download/i.html 2>&1 | egrep "^Enqueuing|Dequeuing|Saving to"

Enqueuing http://localhost/code/html/test/download/i.html at depth 0
Dequeuing http://localhost/code/html/test/download/i.html at depth 0
Saving to: ‘i.html’
Enqueuing http://localhost/code/html/test/download/a/a.html at depth 1
Enqueuing http://localhost/code/html/test/download/b/b.html at depth 1
Enqueuing http://localhost/code/html/test/download/x.jpg at depth 1
Enqueuing http://localhost/code/html/test/download/y.jpg at depth 1
Dequeuing http://localhost/code/html/test/download/y.jpg at depth 1
Saving to: ‘y.jpg’
Dequeuing http://localhost/code/html/test/download/x.jpg at depth 1
Saving to: ‘x.jpg’
Dequeuing http://localhost/code/html/test/download/b/b.html at depth 1
Saving to: ‘b.html’
Enqueuing http://localhost/code/html/test/download/b/a/b-a.html at depth 2
Enqueuing http://localhost/code/html/test/download/b/b/b-b.html at depth 2
Enqueuing http://localhost/code/html/test/download/b/b-x.jpg at depth 2
Enqueuing http://localhost/code/html/test/download/b/b-y.jpg at depth 2
Dequeuing http://localhost/code/html/test/download/b/b-y.jpg at depth 2
Saving to: ‘b-y.jpg’
Dequeuing http://localhost/code/html/test/download/b/b-x.jpg at depth 2
Saving to: ‘b-x.jpg’
Dequeuing http://localhost/code/html/test/download/b/b/b-b.html at depth 2
Saving to: ‘b-b.html’
Enqueuing http://localhost/code/html/test/download/b/b/b-b-x.jpg at depth 3
Enqueuing http://localhost/code/html/test/download/b/b/b-b-y.jpg at depth 3
Dequeuing http://localhost/code/html/test/download/b/b/b-b-y.jpg at depth 3
Saving to: ‘b-b-y.jpg’
Dequeuing http://localhost/code/html/test/download/b/b/b-b-x.jpg at depth 3
Saving to: ‘b-b-x.jpg’
Dequeuing http://localhost/code/html/test/download/b/a/b-a.html at depth 2
Saving to: ‘b-a.html’
Enqueuing http://localhost/code/html/test/download/b/a/b-a-x.jpg at depth 3
Enqueuing http://localhost/code/html/test/download/b/a/b-a-y.jpg at depth 3
Dequeuing http://localhost/code/html/test/download/b/a/b-a-y.jpg at depth 3
Saving to: ‘b-a-y.jpg’
Dequeuing http://localhost/code/html/test/download/b/a/b-a-x.jpg at depth 3
Saving to: ‘b-a-x.jpg’
Dequeuing http://localhost/code/html/test/download/a/a.html at depth 1
Saving to: ‘a.html’
Enqueuing http://localhost/code/html/test/download/a/a/a-a.html at depth 2
Enqueuing http://localhost/code/html/test/download/a/b/a-b.html at depth 2
Enqueuing http://localhost/code/html/test/download/a/a-x.jpg at depth 2
Enqueuing http://localhost/code/html/test/download/a/a-y.jpg at depth 2
Dequeuing http://localhost/code/html/test/download/a/a-y.jpg at depth 2
Saving to: ‘a-y.jpg’
Dequeuing http://localhost/code/html/test/download/a/a-x.jpg at depth 2
Saving to: ‘a-x.jpg’
Dequeuing http://localhost/code/html/test/download/a/b/a-b.html at depth 2
Saving to: ‘a-b.html’
Enqueuing http://localhost/code/html/test/download/a/b/a-b-x.jpg at depth 3
Enqueuing http://localhost/code/html/test/download/a/b/a-b-y.jpg at depth 3
Dequeuing http://localhost/code/html/test/download/a/b/a-b-y.jpg at depth 3
Saving to: ‘a-b-y.jpg’
Dequeuing http://localhost/code/html/test/download/a/b/a-b-x.jpg at depth 3
Saving to: ‘a-b-x.jpg’
Dequeuing http://localhost/code/html/test/download/a/a/a-a.html at depth 2
Saving to: ‘a-a.html’
Enqueuing http://localhost/code/html/test/download/a/a/a-a-x.jpg at depth 3
Enqueuing http://localhost/code/html/test/download/a/a/a-a-y.jpg at depth 3
Dequeuing http://localhost/code/html/test/download/a/a/a-a-y.jpg at depth 3
Saving to: ‘a-a-y.jpg’
Dequeuing http://localhost/code/html/test/download/a/a/a-a-x.jpg at depth 3
Saving to: ‘a-a-x.jpg’
rockdaboot commented 9 years ago

Nice, could you just add these two little changes: (one fixes a warning for me, the other is a show-stopper which prevents generating the docs here) After amending, could you post your suggestion to bug-wget@gnu.org mailing list ? A short explanation + a link to this page should be ok. Most people there don't mess with the Savannah bug tracker.

diff --git a/doc/wget.texi b/doc/wget.texi index 67f74ba..a981fd2 100644 --- a/doc/wget.texi +++ b/doc/wget.texi @@ -1916,7 +1916,7 @@ case. Turn on recursive retrieving. @xref{Recursive Download}, for more details. The default maximum depth is 5.

-@itemx --queue-type=@var{queuetype} +@item --queue-type=@var{queuetype} Specify the queue type (@pxref{Recursive Download}). Accepted values are @samp{fifo} (the default) and @samp{lifo}.

diff --git a/src/init.c b/src/init.c index cd17f98..71b1203 100644 --- a/src/init.c +++ b/src/init.c @@ -1448,7 +1448,7 @@ cmd_spec_recursive (const char _com, const char val, void place_ignored _GLUN / Validate --queue-type and set the choice. */

static bool -cmd_spec_queue_type (const char com, const char val, void place_ignored) +cmd_spec_queue_type (const char com, const char val, void place_ignored _GL_UNUSED) { static const struct decode_item choices[] = { { "fifo", queue_type_fifo },

john-peterson commented 9 years ago

the other is a show-stopper which prevents generating the docs here

how do i generate docs to detect that error? this command doesnt show any error about that

(cd doc; make)
rockdaboot commented 9 years ago

Normally, this will be done automatically by 'make'. Maybe something is missing on your installation (e.g. pod2man, textinfo, makeinfo) so the creation is skipped ? The error was: wget.texi:1919: @itemx must follow @item Makefile:1346: recipe for target 'wget.info' failed make[2]: *\ [wget.info] Error 1

john-peterson commented 9 years ago

Normally, this will be done automatically by 'make'.

is there a make type for that? like make docs? that does something different than (cd doc; make)

Maybe something is missing on your installation (e.g. pod2man, textinfo, makeinfo) so the creation is skipped ?

theres nothing about texinfo in config.log. this is the makeinfo and pod2man output:

configure:38656: checking for makeinfo
configure:38683: result: ${SHELL} /d/repo/wget/build-aux/missing --run makeinfo

configure:38745: checking for pod2man
configure:38763: found /usr/bin/pod2man
configure:38776: result: /usr/bin/pod2man

am i supposed to run that command to get more info?

$ /d/repo/wget/build-aux/missing --run makeinfo
makeinfo: missing file argument.
Try `makeinfo --help' for more information.

makeinfo is 4.13

$ makeinfo --version
makeinfo (GNU texinfo) 4.13
rockdaboot commented 9 years ago

'textinfo' is a typo, should be texinfo ;-) 'cd doc; make clean; make' should output

test -z "wget.dvi wget.pdf wget.ps wget.html" \ || rm -rf wget.dvi wget.pdf wget.ps wget.html test -z "~ .bak .cat .pod" || rm -f ~ .bak .cat .pod rm -rf wget.t2d wget.t2p rm -f vti.tmp oms@blitz-lx:~/src/wget/doc$ make ./texi2pod.pl -D VERSION="1.16.1.36-8238-dirty" ./wget.texi wget.pod /usr/bin/pod2man --center="GNU Wget" --release="GNU Wget 1.16.1.36-8238-dirty" wget.pod > wget.1

So maybe it is this ./texi2pod.pl working different here (or for you) ?

john-peterson commented 9 years ago

i get the error now. dunno why i didnt get it before. maybe bc i didnt do make clean

(cd doc; make clean; make)

../../doc/wget.texi:1919: @itemx must follow @item
Makefile:1346: recipe for target `../../doc/wget.info' failed
make: *** [../../doc/wget.info] Error 1
john-peterson commented 9 years ago

email

After amending, could you post your suggestion to bug-wget@gnu.org mailing list ? A short explanation + a link to this page should be ok.

k email sent

feedback wanted for this patch https://github.com/mirror/wget/pull/1

john-peterson commented 9 years ago

basic problem

as I understand your aim, you want Wget behave a bit more like a browser in respect to downloading. This means after downloading the first HTML page, first download non-HTML links (mainly images), second HTML pages.

yes

depth doesnt matter

I don't see a reason why the 'deepness' of those HTML pages should matter when queuing. Since a user doesn't know how deep the link is that he clicks on.

yup. depth no matter

alternative solution

enqueue html last isnt enough

This leads to a queuing without sorting: put the HTML links at the bottom and the non-HTML links to the top. This would lead to a download order that you documented under 'lifo download links directly after its parent page'.

keeping FIFO and enqueue html links last (with sort) isnt enough because all depth n links are still downloaded before any depth n+1 links

FIFO enqueue html last ≠ LIFO enqueue html first

john-peterson commented 9 years ago

enqueue html last isnt enough

This is not what I said. I said: enqueue html last + enqueue non-html first

This basically the same as having two queues: one for HTML and one for non-HTML. non-HTML working as LIFO, always picked before HTML. If empty, pick from HTML queue (FIFO).

show it with code because i dont understand

the current FIFO code is:

while (1)
    // FIFO
    url_dequeue

    if (descend)
        for (; child; child = child->next)
            url_enqueue

the LIFO solution is:

while (1)
    // LIFO
    url_dequeue

    if (descend)
        // place html pages on top
        ll_bubblesort(&child);
        for (; child; child = child->next)
            url_enqueue
john-peterson commented 9 years ago

closed in favor of #2