recursive option not working with some sites - Githubissues

rockdaboot / mget

Multithreaded metalink/file/website downloader (like Wget) and C library

GNU Lesser General Public License v3.0

114 stars 19 forks source link

recursive option not working with some sites #21

Closed eadmaster closed 10 years ago

eadmaster commented 10 years ago

This is one of them:

mget -r -np http://www.gamewinners.com/C64/index.htm

I've tried both with release v0.1.5 and the latest master branch, i always get only 1 level of recursion. With vanilla wget it works fine, but very slow...

rockdaboot commented 10 years ago

Thanks for reporting. I fixed it in the 'develop' branch. It was a stupid uppercase parsing bug, introduced when doing some performance optimizations.

eadmaster commented 10 years ago

Ok, but now i'm getting a lot of segmentation faults during crawling:

mget -r -np http://www.gamewinners.com/playstation2/index.htm
mget -r -np http://www.gamewinners.com/gamecube/index.htm
mget -r -np http://www.gamewinners.com/GEN/index.htm
mget -r -np http://www.gamewinners.com/DC/index.htm
mget -r -np http://www.gamewinners.com/nes/index.htm
mget -r -np http://www.gamewinners.com/N64/index.htm

rockdaboot commented 10 years ago

Another bug just fixed in 'develop' branch ;-) Thanks for reporting.

rockdaboot commented 10 years ago

FYI, it was a href="file://blabla" attribute. The library parsed it correctly but mget unconditionally checked the 'host' part of that URL, which was of course NULL. Mget is only interested in HTTP and HTTPS URLs, so I added a check for these. Once, the library gave back a NULL when parsing unknown URLs - I changed that a while ago and forgot to insert the needed check in Mget.

eadmaster commented 10 years ago

Perfect, now it works correctly! The only little glitch i see now is that it does not exit when finished (i have to press ctrl+c)

rockdaboot commented 10 years ago

I already tracked that a while ago and fixed it. Seems to lurked in again or my fix got lost ... i'll investigate that in the next few days.

rockdaboot commented 10 years ago

Use --no-robots (or --num-threads=1) to work around this problem. It's seems to be a multi-threading race-condition and also a 'Heisenbug', means e.g. with valgrind or additional debugging lines I can't reproduce the problem :-(. Not much time right now, but since it reduces to the robots code, I'll find in the next days.

rockdaboot commented 10 years ago

The fix is in the 'develop' branch. Thanks for reporting.