khpeek / scraper-compose

Scrapy example project using Tor (through Privoxy) in a Docker Compose multi-container application
11 stars 6 forks source link

scrapercompose_scraper_1 exited with code 1 #2

Open olegario96 opened 6 years ago

olegario96 commented 6 years ago

I cloned the repository and tried to execute the two steps from the README.md. The problem, when I execute docker-compose up the following messages are show:

tor_1      | Sep 14 00:57:54.000 [notice] Bootstrapped 50%: Loading relay descriptors
tor_1      | Sep 14 00:57:57.000 [notice] Bootstrapped 56%: Loading relay descriptors
tor_1      | Sep 14 00:57:58.000 [notice] Bootstrapped 66%: Loading relay descriptors
tor_1      | Sep 14 00:57:59.000 [notice] Bootstrapped 72%: Loading relay descriptors
tor_1      | Sep 14 00:57:59.000 [notice] Bootstrapped 77%: Loading relay descriptors
tor_1      | Sep 14 00:57:59.000 [notice] Bootstrapped 80%: Connecting to the Tor network
tor_1      | Sep 14 00:58:00.000 [notice] Bootstrapped 85%: Finishing handshake with first hop
scraper_1  | Operation timed out
tor_1      | Sep 14 00:58:01.000 [notice] Bootstrapped 90%: Establishing a Tor circuit
scrapercompose_scraper_1 exited with code 1
tor_1      | Sep 14 00:58:02.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.
tor_1      | Sep 14 00:58:02.000 [notice] Bootstrapped 100%: Done

What can be?

khpeek commented 6 years ago

Hi Olegario,

I tried to just run the scrapy crawl quotes command in the /bin/ash shell, and got this error:

Kurts-MacBook-Pro:tutorial kurtpeek$ docker run -it scraper-compose_scraper /bin/ash
/scraper/tutorial # scrapy crawl quotes
2018-09-14 03:19:58 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: tutorial)
2018-09-14 03:19:58 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.7.0 (default, Sep 12 2018, 02:07:16) - [GCC 6.4.0], pyOpenSSL 18.0.0 (OpenSSL 1.0.2o  27 Mar 2018), cryptography 2.3.1, Platform Linux-4.9.93-linuxkit-aufs-x86_64-with
2018-09-14 03:19:58 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'tutorial', 'NEWSPIDER_MODULE': 'tutorial.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['tutorial.spiders']}
Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 150, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 90, in _run_print_help
    func(*a, **kw)
  File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 157, in _run_command
    cmd.run(args, opts)
  File "/usr/local/lib/python3.7/site-packages/scrapy/commands/crawl.py", line 57, in run
    self.crawler_process.crawl(spname, **opts.spargs)
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 170, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 198, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 203, in _create_crawler
    return Crawler(spidercls, self.settings)
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 55, in __init__
    self.extensions = ExtensionManager.from_crawler(self)
  File "/usr/local/lib/python3.7/site-packages/scrapy/middleware.py", line 58, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/usr/local/lib/python3.7/site-packages/scrapy/middleware.py", line 34, in from_settings
    mwcls = load_object(clspath)
  File "/usr/local/lib/python3.7/site-packages/scrapy/utils/misc.py", line 44, in load_object
    mod = import_module(module)
  File "/usr/local/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/usr/local/lib/python3.7/site-packages/scrapy/extensions/telnet.py", line 12, in <module>
    from twisted.conch import manhole, telnet
  File "/usr/local/lib/python3.7/site-packages/twisted/conch/manhole.py", line 154
    def write(self, data, async=False):
                              ^
SyntaxError: invalid syntax

It seems from https://github.com/scrapy/scrapy/issues/3143 that this is an issue with Scrapy itself in Python 3.7 (in which async is a reserved variable name). You might want to try choosing a different image to downgrade the version of Python; feel free to submit a PR if that works!

By the way, this is a fairly 'special' implementation of anonymous scraping which uses the Tor control port to periodically change your apparent IP address. If you don't need this functionality, you could use a simpler image like docker-tor-privoxy-alpine.

olegario96 commented 6 years ago

How can I downgrade to version 3.6?

olegario96 commented 6 years ago

I managed to change the Python version using this dockerfile

# Adapted from trcook/docker-scrapy
FROM python:3.6-alpine
RUN apk --update add python3
RUN echo 'alias python=python3.6' >> ~/.bashrc
RUN apk --update add libxml2-dev libxslt-dev libffi-dev gcc musl-dev libgcc openssl-dev curl
RUN pip install scrapy scrapy-fake-useragent stem pyparsing python-dateutil requests
COPY tutorial /scraper/tutorial
COPY wait-for/wait-for /scraper/tutorial
WORKDIR /scraper/tutorial
CMD ["./wait-for", "tor:9050", "--", "scrapy", "crawl", "quotes"]

But the problem continues

I removed the --silent from the curl command and it says:

Received HTTP code 500 from proxy after CONNECT
argalasjr commented 1 year ago

torrc file needs to be updated to work

add this line:

SOCKSport 0.0.0.0:9050