rivermont / spidy

The simple, easy to use command line web crawler.
GNU General Public License v3.0
334 stars 69 forks source link

unusable #81

Closed T35R6braPwgDJKq closed 1 year ago

T35R6braPwgDJKq commented 3 years ago

hi, i tried to use spidy b.c. it looked promising. Is it dead?

first: sudo pip install -r requirements.txt doest work, reppy is not installable (python 3.9)

snd: Docker is a pita... Please look into ConfigArgParse if you need config files BUT make sure that arguments can be used as well with docker, there is no error log... I ended with docker run --rm -it -v $PWD:/data -w /data --entrypoint /src/app/spidy/crawler.py spidy so that the error log is accessible (why is there no config option?!)

why is a suffix on the config file enforced? What is that? Windows?

thrd: my config contained either an Ip or a hostname (resolved via /etc/hosts) Spidy did not spider either. For the hostname option it gave

ERROR: OSError EXT: HTTPConnectionPool(host='example.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4ae176ecc0>: Failed to establish a new connection: [Errno -2] Name or service not known',))

Seems that it doesnt respect /etc/hosts?! But neither did the ip option work... e.g. '192.168.1.55/wiki/'

rivermont commented 3 years ago

Until Reppy is updated on PyPI with support for Python 3.9 we will need to remove it as a robots.txt parser, at least for versions >3.8 (see discussion at scrapy/scrapy/issues/5230, scrapy/scrapy/issues/5230, possibly fixed in scrapy/scrapy/pull/4759?). Currently looking into possible replacements; in the meantime I may create a separate branch without Reppy.

The Docker is a known issue, see #72. Am hoping to get it looked into soon.
The config process could definitely be updated, if a rewrite happens in the future that is one thing that needs done.
I'm not sure what config suffix you're referring to; do you mean the file path?

I'm not sure what the issue regarding your hosts file is, it's not like spidy makes requests in a unique way that aren't processed by hosts files. It looks like your local wiki was not connecting for some reason, or maybe a separate connection issue?

rivermont commented 1 year ago

Closing as first two issues are covered by #89 and #72, and third looks like a personal environment issue.