Closed eadmaster closed 10 years ago
Thanks for reporting. I fixed it in the 'develop' branch. It was a stupid uppercase parsing bug, introduced when doing some performance optimizations.
Ok, but now i'm getting a lot of segmentation faults during crawling:
mget -r -np http://www.gamewinners.com/playstation2/index.htm
mget -r -np http://www.gamewinners.com/gamecube/index.htm
mget -r -np http://www.gamewinners.com/GEN/index.htm
mget -r -np http://www.gamewinners.com/DC/index.htm
mget -r -np http://www.gamewinners.com/nes/index.htm
mget -r -np http://www.gamewinners.com/N64/index.htm
Another bug just fixed in 'develop' branch ;-) Thanks for reporting.
FYI, it was a href="file://blabla" attribute. The library parsed it correctly but mget unconditionally checked the 'host' part of that URL, which was of course NULL. Mget is only interested in HTTP and HTTPS URLs, so I added a check for these. Once, the library gave back a NULL when parsing unknown URLs - I changed that a while ago and forgot to insert the needed check in Mget.
Perfect, now it works correctly! The only little glitch i see now is that it does not exit when finished (i have to press ctrl+c)
I already tracked that a while ago and fixed it. Seems to lurked in again or my fix got lost ... i'll investigate that in the next few days.
Use --no-robots (or --num-threads=1) to work around this problem. It's seems to be a multi-threading race-condition and also a 'Heisenbug', means e.g. with valgrind or additional debugging lines I can't reproduce the problem :-(. Not much time right now, but since it reduces to the robots code, I'll find in the next days.
The fix is in the 'develop' branch. Thanks for reporting.
This is one of them:
I've tried both with release v0.1.5 and the latest master branch, i always get only 1 level of recursion. With vanilla wget it works fine, but very slow...