rivermont / spidy

The simple, easy to use command line web crawler.
GNU General Public License v3.0
334 stars 69 forks source link

Failed crawl for http://www.frankshospitalworkshop.com/ #75

Closed brainstorm closed 2 years ago

brainstorm commented 4 years ago
$ docker run --rm -it -v $PWD:/data spidy
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Starting spidy Web Crawler version 1.6.5
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Report any problems to GitHub at https://github.com/rivermont/spidy
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Creating classes...
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Creating functions...
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Creating variables...
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Should spidy load settings from an available config file? (y/n):
n
[01:01:40] [spidy] [WORKER #0] [INIT] [INFO]: Please enter the following arguments. Leave blank to use the default values.
[01:01:40] [spidy] [WORKER #0] [INIT] [INPUT]: How many parallel threads should be used for crawler? (Default: 1):

[01:01:47] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy load from existing save files? (y/n) (Default: Yes):

[01:01:54] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy raise NEW errors and stop crawling? (y/n) (Default: No):

[01:01:55] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy save the pages it scrapes to the saved folder? (y/n) (Default: Yes):

[01:01:55] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy zip saved documents when autosaving? (y/n) (Default: No):

[01:01:57] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy download documents larger than 500 MB? (y/n) (Default: No):

[01:01:58] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy scrape words and save them? (y/n) (Default: Yes):

[01:01:59] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy restrict crawling to a specific domain only? (y/n) (Default: No):
y
[01:02:02] [spidy] [WORKER #0] [INIT] [INPUT]: What domain should crawling be limited to? Can be subdomains, http/https, etc.
http://www.frankshospitalworkshop.com/
[01:02:07] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy respect sites' robots.txt? (y/n) (Default: Yes):
y
[01:02:13] [spidy] [WORKER #0] [INIT] [INPUT]: What HTTP browser headers should spidy imitate?
[01:02:13] [spidy] [WORKER #0] [INIT] [INPUT]: Choices: spidy (default), Chrome, Firefox, IE, Edge, Custom:

[01:02:14] [spidy] [WORKER #0] [INIT] [INPUT]: Location of the TODO save file (Default: crawler_todo.txt):
/data/crawler_todo.txt
[01:02:24] [spidy] [WORKER #0] [INIT] [INPUT]: Location of the DONE save file (Default: crawler_done.txt):
/data/crawler_done.txt
[01:02:31] [spidy] [WORKER #0] [INIT] [INPUT]: Location of the words save file (Default: crawler_words.txt):
/data/crawler_words.txt
[01:02:38] [spidy] [WORKER #0] [INIT] [INPUT]: After how many queried links should the crawler autosave? (Default: 100):

[01:02:39] [spidy] [WORKER #0] [INIT] [INPUT]: After how many new errors should spidy stop? (Default: 5):

[01:02:40] [spidy] [WORKER #0] [INIT] [INPUT]: After how many known errors should spidy stop? (Default: 10):

[01:02:41] [spidy] [WORKER #0] [INIT] [INPUT]: After how many HTTP errors should spidy stop? (Default: 20):

[01:02:42] [spidy] [WORKER #0] [INIT] [INPUT]: After encountering how many new MIME types should spidy stop? (Default: 20):

[01:02:43] [spidy] [WORKER #0] [INIT] [INFO]: Loading save files...
[01:02:43] [spidy] [WORKER #0] [INIT] [INFO]: Successfully started spidy Web Crawler version 1.6.5...
[01:02:43] [spidy] [WORKER #0] [INIT] [INFO]: Using headers: {'User-Agent': 'spidy Web Crawler (Mozilla/5.0; bot; +https://github.com/rivermont/spidy/)', 'Accept-Language': 'en_US, en-US, en', 'Accept-Encoding': 'gzip', 'Connection': 'keep-alive'}
[01:02:43] [spidy] [WORKER #0] [INIT] [INFO]: Spawning 1 worker threads...
[01:02:43] [spidy] [WORKER #1] [INIT] [INFO]: Starting crawl...
[01:02:43] [reppy] [WORKER #0] [ROBOTS] [INFO]: Reading robots.txt file at: http://www.frankshospitalworkshop.com/robots.txt
[01:02:45] [spidy] [WORKER #1] [CRAWL] [ERROR]: An error was raised trying to process http://www.frankshospitalworkshop.com/equipment.html
[01:02:45] [spidy] [WORKER #1] [ERROR] [INFO]: An XMLSyntaxError occurred. A web dev screwed up somewhere.
[01:02:45] [spidy] [WORKER #1] [LOG] [INFO]: Saved error message and timestamp to error log file
[01:02:45] [spidy] [WORKER #0] [CRAWL] [INFO]: Stopping all threads...
[01:02:45] [spidy] [WORKER #0] [CRAWL] [INFO]: I think you've managed to download the entire internet. I guess you'll want to save your files...
[01:02:45] [spidy] [WORKER #0] [SAVE] [INFO]: Saved TODO list to /data/crawler_todo.txt
[01:02:45] [spidy] [WORKER #0] [SAVE] [INFO]: Saved DONE list to /data/crawler_todo.txt
[01:02:45] [spidy] [WORKER #0] [SAVE] [INFO]: Saved 0 words to /data/crawler_words.txt
rivermont commented 3 years ago

I get no errors crawling the site, but I'm not using the Docker container. I'll try with Docker at some later point but this may have been fixed by #77 or by a change in the site itself.

rivermont commented 2 years ago

Was resolved by #77