rivermont / spidy

The simple, easy to use command line web crawler.
GNU General Public License v3.0
334 stars 69 forks source link

Fails to crawl certain sites #70

Closed PtrMan closed 3 years ago

PtrMan commented 6 years ago

Expected Behavior

crawl like hell

Actual Behavior

dies of an unknown error

Steps to Reproduce the Problem

echo "https://www.golem.de/"> ./crawler_todo.txt spidy

What I've tried so far:

Using spidy

rivermont commented 6 years ago

Hi @PtrMan, can you provide some more details as to how you were using the crawler? If you tried to cat a link into crawler_todo.txt before running the crawler and didn't direct the crawler to that it might have overwritten it. If you started the crawler and put the link in the file after it had run for a bit it's likely that there were other links and the crawler never got to yours. Also, can you please provide the error that you encountered?

PtrMan commented 6 years ago

To make it more clear:

echo "https://www.golem.de/" >> ./crawler_todo.txt

(it was the only url)

rivermont commented 6 years ago

Hmm it seems to be a problem with the robots.txt parser.

[10:45:48] [reppy] [WORKER #0] [ROBOTS] [INFO]: Reading robots.txt file at: /robots.txth/robots.txtt/robots.txtt/robots.txtp/robots.txts/robots.txt:/robots.txt//robots.txt//robots.txtg/robots.txto/robots.txtl/robots.txte/robots.txtm/robots.txt./robots.txtd/robots.txte/robots.txt

It should just find 'https://golem.de/robots.txt'.

I'll have to look into this.

rivermont commented 3 years ago

Resolved by #77