istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 323 forks source link

Docker compose : kafka and crawler containers restarting forever #206

Closed llermaly closed 5 years ago

llermaly commented 5 years ago

Hello, I'm trying to make quickstart work but after pulling containers wurstmeister/kafka and scrapy-cluster:crawler containers keep status restarting.

From error logs :

kafka:

 ERROR: No listener or advertised hostname configuration provided in environment.
2018-11-19T22:46:22.884011669Z        Please define KAFKA_LISTENERS / (deprecated) KAFKA_ADVERTISED_HOST_NAME

crawler:

ERROR: Unable to connect to Kafka in Pipeline, raising exit flag.
2018-11-19T22:47:41.710214263Z Unhandled error in Deferred:

Testing pass OK on the other containers but after passing start throwing kafka related errors.

Thanks

llermaly commented 5 years ago

Solved :)

Step 1 :

Modify docker-compose.yml file inside scrapy-cluster folder, add some enviroment variables :

environment:
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_ADVERTISED_HOST_NAME: "kafka"
      KAFKA_ADVERTISED_PORT: "9092"

KAFKA_ADVERTISED_HOST_NAME and KAFKA_ADVERTISED_PORT were missing

run docker-compose up -d

now tests are passing

llermaly commented 5 years ago

Not so easy... getting this error from rest service :

Traceback (most recent call last):
  File "rest_service.py", line 727, in <module>
    rest_service.run()
  File "rest_service.py", line 449, in run
    self.app.run(host='0.0.0.0', port=self.settings['FLASK_PORT'])
  File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 841, in run
    run_simple(host, port, self, **options)
  File "/usr/local/lib/python2.7/site-packages/werkzeug/serving.py", line 736, in run_simple
    inner()
  File "/usr/local/lib/python2.7/site-packages/werkzeug/serving.py", line 696, in inner
    fd=fd)
  File "/usr/local/lib/python2.7/site-packages/werkzeug/serving.py", line 590, in make_server
    passthrough_errors, ssl_context, fd=fd)
  File "/usr/local/lib/python2.7/site-packages/werkzeug/serving.py", line 501, in __init__
    HTTPServer.__init__(self, (host, int(port)), handler)
  File "/usr/local/lib/python2.7/SocketServer.py", line 417, in __init__
    self.server_bind()
  File "/usr/local/lib/python2.7/BaseHTTPServer.py", line 108, in server_bind
    SocketServer.TCPServer.server_bind(self)
  File "/usr/local/lib/python2.7/SocketServer.py", line 431, in server_bind
    self.socket.bind(self.server_address)
  File "/usr/local/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
socket.error: [Errno 98] Address already in use

Found this :

root@7ed8a515ded0:/usr/src/app# lsof -i:5343
COMMAND PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
python    1 root    8u  IPv4 846730      0t0  TCP *:5343 (LISTEN)
llermaly commented 5 years ago

Solved :)

Go to localsettings.py in rest cluster and change FLASK port 5344 Go to docker-composer.yml and change rest port to ie 5344

Not sure if this could cause future problems, hope not.

regards

madisonb commented 5 years ago

I am not sure what docker-compose file you were using, but https://github.com/istresearch/scrapy-cluster/blob/master/docker-compose.yml#L49 has the KAFKA_ADVERTISED_HOST_NAME already there.

As to your second comment, it looks like you already have something running on that port or there is some other issue. A port conflict within the container shouldn't be happening as there is only one process running.

This project is mostly maintained on the dev branch, so if you can give me steps to reproduce the above on that branch I would be happy to look into it.

Otherwise, I am closing this for now.

llermaly commented 5 years ago

I downloaded and unzipped from here as tutorial says :

https://github.com/istresearch/scrapy-cluster/releases https://github.com/istresearch/scrapy-cluster/archive/v1.2.1.zip

And KAFKA_ADVERTISED_HOST_NAME is not there

regarding second issue I'm not running anything before launching rest in docker, so probably im using the wrong docker containers.

Thanks!

madisonb commented 5 years ago

Ah gotcha, in looking at the diff between master and that release it indeed has the hostname fix https://github.com/istresearch/scrapy-cluster/compare/v1.2.1...master but I didn't deem it worthy of a release. Feel free to move over to the Gitter chat for more informal questions not directly related to bugs.