istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 323 forks source link

No output when dumping incoming or outbound_firehose #176

Closed cliff-km closed 6 years ago

cliff-km commented 6 years ago

I'm attempting to get started with 1.2.1 in docker. I've downloaded the project and followed the docker instructions in getting started. When doing the first scrape I can dump and get output from the crawl but not demo.incoming or demo.outbound_firehose.

I don't think this is related but I ran into compatibility issues with the latest Kafka image so I set the version to 1.0.0 in the docker-compose.yml which seemed to be the latest when 1.2.1 was released. This got me past that issue. It's the only change I've made to the project.

Also all the tests pass in the docker images. However in the redis monitor on the first run I get:

OK
test_process_item (__main__.TestRedisMonitor) ... No handlers could be found for logger "redis_lock"
ok

My steps are:

  1. docker-compose up -d
  2. [terminal 1] docker exec -i scrapycluster121_kafka_monitor_1 python kafkadump.py dump -t demo.crawled_firehose
  3. [terminal 2] docker exec -i scrapycluster121_kafka_monitor_1 python kafkadump.py dump -t demo.incoming
  4. [terminal 3] docker exec -i scrapycluster121_kafka_monitor_1 python kafkadump.py dump -t demo.outbound_firehose
  5. [terminal 4] docker exec -i scrapycluster121_kafka_monitor_1 python kafka_monitor.py feed '{"url": "http://dmoztools.net", "appid":"testapp", "crawlid":"abc1234", "maxdepth":1}'
  6. [terminal 4] docker exec -i scrapycluster121_kafka_monitor_1 python kafka_monitor.py feed '{"action":"info", "appid":"testapp", "uuid":"someuuid", "crawlid":"abc1234", "spiderid":"link"}'

After step 5 I start getting Scrapy output on terminal 1. I never get output on terminal 2 or 3.

madisonb commented 6 years ago

So your issue is that the redis-monitor doesnt seem to be connecting to you cluster?

What are the logs from the redis monitor container? The logger for redis_lock is okay at this point, I don't think it is an issue (that's from an internal library log). Do the logs look like the process connects to the cluster?

You can also exec into the container and run ./run_docker_tests.sh to ensure that both the unit tests and integration tests pass. If that script hangs or doesn't complete within 5 mins you should double check your configurations. It would make sense if the redis monitor wasnt working that you would not receive your output in terminal 3.

cliff-km commented 6 years ago

After setting LOG_LEVELs to DEBUG and looking back through the logs I found that it's not initializing the first two consumers and producers. Maybe a race condition?

Kafka monitor:

2018-05-02 19:15:27,847 [kafka-monitor] DEBUG: Logging to stdout
2018-05-02 19:15:27,848 [kafka-monitor] DEBUG: Creating new kafka consumer using brokers: ['kafka:9092'] and topic demo.incoming
2018-05-02 19:15:27,952 [kafka-monitor] ERROR: Couldn't initialize kafka consumer for topic
2018-05-02 19:15:28,953 [kafka-monitor] DEBUG: Creating new kafka consumer using brokers: ['kafka:9092'] and topic demo.incoming
2018-05-02 19:15:30,863 [kafka-monitor] ERROR: Couldn't initialize kafka consumer for topic
2018-05-02 19:15:32,866 [kafka-monitor] DEBUG: Creating new kafka consumer using brokers: ['kafka:9092'] and topic demo.incoming
2018-05-02 19:15:32,974 [kafka-monitor] DEBUG: Successfully connected to Kafka
2018-05-02 19:15:32,974 [kafka-monitor] DEBUG: Trying to load plugin plugins.zookeeper_handler.ZookeeperHandler
2018-05-02 19:15:33,039 [kafka-monitor] DEBUG: Connected to Redis in ZookeeperHandler
2018-05-02 19:15:33,039 [kafka-monitor] DEBUG: Successfully loaded plugin plugins.zookeeper_handler.ZookeeperHandler
2018-05-02 19:15:33,040 [kafka-monitor] DEBUG: Trying to load plugin plugins.stats_handler.StatsHandler
2018-05-02 19:15:33,041 [kafka-monitor] DEBUG: Connected to Redis in StatsHandler
2018-05-02 19:15:33,041 [kafka-monitor] DEBUG: Successfully loaded plugin plugins.stats_handler.StatsHandler
2018-05-02 19:15:33,041 [kafka-monitor] DEBUG: Trying to load plugin plugins.scraper_handler.ScraperHandler
2018-05-02 19:15:33,043 [kafka-monitor] DEBUG: Connected to Redis in ScraperHandler
2018-05-02 19:15:33,043 [kafka-monitor] DEBUG: Successfully loaded plugin plugins.scraper_handler.ScraperHandler
2018-05-02 19:15:33,043 [kafka-monitor] DEBUG: Trying to load plugin plugins.action_handler.ActionHandler
2018-05-02 19:15:33,045 [kafka-monitor] DEBUG: Connected to Redis in ActionHandler
2018-05-02 19:15:33,045 [kafka-monitor] DEBUG: Successfully loaded plugin plugins.action_handler.ActionHandler
2018-05-02 19:15:33,046 [kafka-monitor] DEBUG: Connected to Redis in StatsCollector Setup

Redis monitor:

2018-05-02 19:15:26,380 [redis-monitor] DEBUG: Logging to stdout
2018-05-02 19:15:26,384 [redis-monitor] DEBUG: Successfully connected to Redis
2018-05-02 19:15:26,385 [redis-monitor] DEBUG: Trying to load plugin plugins.stop_monitor.StopMonitor
2018-05-02 19:15:26,527 [redis-monitor] DEBUG: Creating new kafka producer using brokers: ['kafka:9092']
2018-05-02 19:15:26,633 [redis-monitor] ERROR: Couldn't initialize kafka producer in plugin.
2018-05-02 19:15:27,634 [redis-monitor] DEBUG: Creating new kafka producer using brokers: ['kafka:9092']
2018-05-02 19:15:27,737 [redis-monitor] ERROR: Couldn't initialize kafka producer in plugin.
2018-05-02 19:15:29,739 [redis-monitor] DEBUG: Creating new kafka producer using brokers: ['kafka:9092']
2018-05-02 19:15:30,944 [redis-monitor] DEBUG: Successfully connected to Kafka in StopMonitor
2018-05-02 19:15:30,945 [redis-monitor] DEBUG: Trying to load plugin plugins.stats_monitor.StatsMonitor
2018-05-02 19:15:30,951 [redis-monitor] DEBUG: Creating new kafka producer using brokers: ['kafka:9092']
2018-05-02 19:15:31,071 [redis-monitor] DEBUG: Successfully connected to Kafka in StatsMonitor
2018-05-02 19:15:31,071 [redis-monitor] DEBUG: Trying to load plugin plugins.zookeeper_monitor.ZookeeperMonitor
2018-05-02 19:15:31,106 [redis-monitor] DEBUG: Creating new kafka producer using brokers: ['kafka:9092']
2018-05-02 19:15:31,219 [redis-monitor] DEBUG: Successfully connected to Kafka in ZookeeperMonitor
2018-05-02 19:15:31,246 [redis-monitor] DEBUG: Trying to load plugin plugins.info_monitor.InfoMonitor
2018-05-02 19:15:31,249 [redis-monitor] DEBUG: Creating new kafka producer using brokers: ['kafka:9092']
2018-05-02 19:15:31,360 [redis-monitor] DEBUG: Successfully connected to Kafka in InfoMonitor
2018-05-02 19:15:31,361 [redis-monitor] DEBUG: Trying to load plugin plugins.expire_monitor.ExpireMonitor
2018-05-02 19:15:31,361 [redis-monitor] DEBUG: Creating new kafka producer using brokers: ['kafka:9092']
2018-05-02 19:15:31,471 [redis-monitor] DEBUG: Successfully connected to Kafka in ExpireMonitor

Yet, all the tests pass. Here are the results I get in the Kafka Monitor:

root@f3bad221b240:/usr/src/app# ./run_docker_tests.sh
test_load_plugins (tests.test_kafka_monitor.TestKafkaMonitor) ... ok
test_load_stats_plugins (tests.test_kafka_monitor.TestKafkaMonitor) ... ok
test_load_stats_total (tests.test_kafka_monitor.TestKafkaMonitor) ... ok
test_process_messages (tests.test_kafka_monitor.TestKafkaMonitor) ... ok
test_action_handler (tests.test_plugins.TestPlugins) ... ok
test_bad_plugins (tests.test_plugins.TestPlugins) ... ok
test_default_handler (tests.test_plugins.TestPlugins) ... ok
test_scrape_handler (tests.test_plugins.TestPlugins) ... ok
test_stats_handler (tests.test_plugins.TestPlugins) ... ok
test_zookeeper_handler (tests.test_plugins.TestPlugins) ... ok

----------------------------------------------------------------------
Ran 10 tests in 0.710s

OK
test_feed (__main__.TestKafkaMonitor) ... 2018-05-02 18:53:25,052 [kafka-monitor] INFO: Feeding JSON into demo.incoming_test
{
    "action": "info",
    "spiderid": "link",
    "uuid": "mytestid",
    "appid": "testapp"
}
2018-05-02 18:53:25,083 [kafka-monitor] INFO: Successfully fed item to Kafka
ok
test_run (__main__.TestKafkaMonitor) ... ok

----------------------------------------------------------------------
Ran 2 tests in 12.831s

OK

Here are results I get in the Redis Monitor:

root@99c21f593d96:/usr/src/app# ./run_docker_tests.sh
test_bad_plugins (tests.test_plugins.TestBasePlugins) ... ok
test_default_monitor (tests.test_plugins.TestBasePlugins) ... ok
test_expire_monitor_time (tests.test_plugins.TestExpirePlugin) ... ok
test_stop_regex (tests.test_plugins.TestExpirePlugin) ... ok
test_info_get_appid (tests.test_plugins.TestInfoPlugin) ... ok
test_info_get_bin (tests.test_plugins.TestInfoPlugin) ... ok
test_info_get_crawlid (tests.test_plugins.TestInfoPlugin) ... ok
test_info_regex (tests.test_plugins.TestInfoPlugin) ... ok
test_stats_get_machine (tests.test_plugins.TestStatsPlugin) ... ok
test_stats_get_plugin (tests.test_plugins.TestStatsPlugin) ... ok
test_stats_get_queue (tests.test_plugins.TestStatsPlugin) ... ok
test_stats_get_spider (tests.test_plugins.TestStatsPlugin) ... ok
test_stats_handle (tests.test_plugins.TestStatsPlugin) ... ok
test_stats_regex (tests.test_plugins.TestStatsPlugin) ... ok
test_stop_monitor_mini_purge (tests.test_plugins.TestStopPlugin) ... ok
test_stop_regex (tests.test_plugins.TestStopPlugin) ... ok
test_zk_errors (tests.test_plugins.TestZookeeperPlugin) ... ok
test_zk_handle_br (tests.test_plugins.TestZookeeperPlugin) ... ok
test_zk_handle_bu (tests.test_plugins.TestZookeeperPlugin) ... ok
test_zk_handle_dr (tests.test_plugins.TestZookeeperPlugin) ... ok
test_zk_handle_du (tests.test_plugins.TestZookeeperPlugin) ... ok
test_zk_regex (tests.test_plugins.TestZookeeperPlugin) ... ok
test_active_plugins (tests.test_redis_monitor.TestRedisMonitor) ... ok
test_get_fail_key (tests.test_redis_monitor.TestRedisMonitor) ... ok
test_load_plugins (tests.test_redis_monitor.TestRedisMonitor) ... ok
test_load_stats_plugins (tests.test_redis_monitor.TestRedisMonitor) ... ok
test_locking_actions (tests.test_redis_monitor.TestRedisMonitor) ... ok
test_main_loop (tests.test_redis_monitor.TestRedisMonitor) ... ok
test_precondition (tests.test_redis_monitor.TestRedisMonitor) ... ok
test_process_failures (tests.test_redis_monitor.TestRedisMonitor) ... ok
test_report_self (tests.test_redis_monitor.TestRedisMonitor) ... ok

----------------------------------------------------------------------
Ran 31 tests in 0.209s

OK
test_process_item (__main__.TestRedisMonitor) ... No handlers could be found for logger "redis_lock"
ok

----------------------------------------------------------------------
Ran 1 test in 16.616s

OK
madisonb commented 6 years ago

Those logs all make sense, Kafka takes maybe 10-15 seconds to boot via container and if you run everything as up -d there will be some errors connecting for the first couple seconds.

Given that the integration tests pass in your existing setup, to me it looks like everything should be able to talk to each other.

If you have the debug logs turned on, I would walk through the following flow

  1. Send the kafka monitor info action to the cluster (step 6 above)
  2. Use the kafkadump.py script to verify the message was actually written to kafka
  3. Check the KM logs to verify it received the request, you should either see a pass/fail here. If the schema fails validation, the data will not go on (and double check the request you set).
  4. If data is valid, you should see a key within Redis pop up, this is what the redis monitor looks for. The key is info:<spiderid>:<appid>:<crawlid>
  5. Verify the redis monitor logs that it sees the key, and begins to act on it
  6. The plugin should then send a message once it is done processing to the demo.outbound_firehose topic. Verify the logs say that the data was sent.
  7. Verify the step 6 topic via kafkadump.py to see ensure everything is working.

Hopefully that helps track your issue down.

cliff-km commented 6 years ago

All that checks out except (inconsistently) the final output. With LOG_LEVEL at DEBUG just now everything worked out the gate after missing the first attempt. Meaning, the output missed the first action or was lagging behind by 1. However I've noticed after messing with this all day that it often doesn't start working for a long time (see below).

Input:

> docker exec -i scrapycluster121_kafka_monitor_1 python kafka_monitor.py feed '{"action":"info", "appid":"testapp", "uuid":"someuuid", "crawlid":"abc1234", "spiderid":"link"}'
2018-05-02 23:07:09,848 [kafka-monitor] DEBUG: Logging to stdout
2018-05-02 23:07:09,848 [kafka-monitor] DEBUG: Creating new kafka producer using brokers: ['kafka:9092']
2018-05-02 23:07:09,958 [kafka-monitor] INFO: Feeding JSON into demo.incoming
{
    "action": "info",
    "crawlid": "abc1234",
    "spiderid": "link",
    "uuid": "someuuid",
    "appid": "testapp"
}
2018-05-02 23:07:10,025 [kafka-monitor] INFO: Successfully fed item to Kafka
> docker exec -i scrapycluster121_kafka_monitor_1 python kafka_monitor.py feed '{"action":"info", "appid":"testapp", "uuid":"someuuid", "crawlid":"abc1234", "spiderid":"link"}'
2018-05-02 23:07:16,122 [kafka-monitor] DEBUG: Logging to stdout
2018-05-02 23:07:16,122 [kafka-monitor] DEBUG: Creating new kafka producer using brokers: ['kafka:9092']
2018-05-02 23:07:16,233 [kafka-monitor] INFO: Feeding JSON into demo.incoming
{
    "action": "info",
    "crawlid": "abc1234",
    "spiderid": "link",
    "uuid": "someuuid",
    "appid": "testapp"
}
2018-05-02 23:07:16,245 [kafka-monitor] INFO: Successfully fed item to Kafka
> docker exec -i scrapycluster121_kafka_monitor_1 python kafka_monitor.py feed '{"action":"info", "appid":"testapp", "uuid":"someuuid", "crawlid":"abc1234", "spiderid":"link"}'
2018-05-02 23:07:30,739 [kafka-monitor] DEBUG: Logging to stdout
2018-05-02 23:07:30,739 [kafka-monitor] DEBUG: Creating new kafka producer using brokers: ['kafka:9092']
2018-05-02 23:07:30,850 [kafka-monitor] INFO: Feeding JSON into demo.incoming
{
    "action": "info",
    "crawlid": "abc1234",
    "spiderid": "link",
    "uuid": "someuuid",
    "appid": "testapp"
}
2018-05-02 23:07:30,877 [kafka-monitor] INFO: Successfully fed item to Kafka
> docker exec -i scrapycluster121_kafka_monitor_1 python kafka_monitor.py feed '{"action":"info", "appid":"testapp", "uuid":"someuuid", "crawlid":"abc1234", "spiderid":"link"}'
2018-05-02 23:07:33,006 [kafka-monitor] DEBUG: Logging to stdout
2018-05-02 23:07:33,006 [kafka-monitor] DEBUG: Creating new kafka producer using brokers: ['kafka:9092']
2018-05-02 23:07:33,118 [kafka-monitor] INFO: Feeding JSON into demo.incoming
{
    "action": "info",
    "crawlid": "abc1234",
    "spiderid": "link",
    "uuid": "someuuid",
    "appid": "testapp"
}
2018-05-02 23:07:33,127 [kafka-monitor] INFO: Successfully fed item to Kafka

Incoming:

> docker exec -i scrapycluster121_kafka_monitor_1 python kafkadump.py dump -t demo.incoming
2018-05-02 23:06:55,918 [kafkadump] DEBUG: Logging to stdout
2018-05-02 23:06:55,918 [kafkadump] DEBUG: Running dump command
2018-05-02 23:06:55,918 [kafkadump] DEBUG: Getting Kafka consumer
2018-05-02 23:07:10,013 [kafkadump] DEBUG: Received message
{u'action': u'info', u'appid': u'testapp', u'spiderid': u'link', u'crawlid': u'abc1234', u'uuid': u'someuuid'}
2018-05-02 23:07:16,239 [kafkadump] DEBUG: Received message
{u'action': u'info', u'appid': u'testapp', u'spiderid': u'link', u'crawlid': u'abc1234', u'uuid': u'someuuid'}
2018-05-02 23:07:30,864 [kafkadump] DEBUG: Received message
{u'action': u'info', u'appid': u'testapp', u'spiderid': u'link', u'crawlid': u'abc1234', u'uuid': u'someuuid'}
2018-05-02 23:07:33,124 [kafkadump] DEBUG: Received message

Outbound:

> docker exec -i scrapycluster121_kafka_monitor_1 python kafkadump.py dump -t demo.outbound_firehose
2018-05-02 23:06:52,587 [kafkadump] DEBUG: Logging to stdout
2018-05-02 23:06:52,587 [kafkadump] DEBUG: Running dump command
2018-05-02 23:06:52,587 [kafkadump] DEBUG: Getting Kafka consumer
2018-05-02 23:07:10,118 [kafkadump] DEBUG: Received message
{u'server_time': 1525302430, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
2018-05-02 23:07:16,321 [kafkadump] DEBUG: Received message
{u'server_time': 1525302436, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
2018-05-02 23:07:31,004 [kafkadump] DEBUG: Received message
{u'server_time': 1525302450, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
2018-05-02 23:07:33,232 [kafkadump] DEBUG: Received message

Redis monitor:

> docker exec -it scrapycluster121_redis_1 redis-cli monitor | grep "info:link:testapp:abc1234"
1525302430.028180 [0 172.18.0.6:48408] "SET" "info:link:testapp:abc1234" "someuuid"
1525302430.071071 [0 172.18.0.7:40058] "GET" "lock:info:link:testapp:abc1234"
1525302430.071462 [0 172.18.0.7:40058] "SET" "lock:info:link:testapp:abc1234" "\x83\x91\x12\xe7\xe5\x00Ue\xd8\x19\t\xd4\xb3\xc2\x85\x17" "EX" "6" "NX"
1525302430.072478 [0 172.18.0.7:40058] "GET" "info:link:testapp:abc1234"
1525302430.075588 [0 172.18.0.7:40058] "PFADD" "stats:redis-monitor:total:lifetime" "info:link:testapp:abc1234:someuuid1525302430.07"
1525302430.078017 [0 172.18.0.7:40058] "PFADD" "stats:redis-monitor:InfoMonitor:lifetime" "info:link:testapp:abc1234:someuuid1525302430.08"
1525302430.084827 [0 172.18.0.7:40058] "DEL" "info:link:testapp:abc1234"
1525302430.085241 [0 172.18.0.7:40058] "EXISTS" "lock:info:link:testapp:abc1234:failures"
1525302430.085411 [0 172.18.0.7:40058] "GET" "lock:info:link:testapp:abc1234"
1525302430.090082 [0 172.18.0.7:40058] "EVALSHA" "dc1e6ea4fc20d70a60aa3ab10780c696e74d541e" "2" "lock:info:link:testapp:abc1234" "lock-signal:info:link:testapp:abc1234" "\x83\x91\x12\xe7\xe5\x00Ue\xd8\x19\t\xd4\xb3\xc2\x85\x17"
1525302430.090451 [0 172.18.0.7:40058] "EVAL" "\n    if redis.call(\"get\", KEYS[1]) ~= ARGV[1] then\n        return 1\n    else\n        redis.call(\"del\", KEYS[2])\n        redis.call(\"lpush\", KEYS[2], 1)\n        redis.call(\"del\", KEYS[1])\n        return 0\n    end\n" "2" "lock:info:link:testapp:abc1234" "lock-signal:info:link:testapp:abc1234" "\x83\x91\x12\xe7\xe5\x00Ue\xd8\x19\t\xd4\xb3\xc2\x85\x17"
1525302430.090654 [0 lua] "get" "lock:info:link:testapp:abc1234"
1525302430.090662 [0 lua] "del" "lock-signal:info:link:testapp:abc1234"
1525302430.090674 [0 lua] "lpush" "lock-signal:info:link:testapp:abc1234" "1"
1525302430.090703 [0 lua] "del" "lock:info:link:testapp:abc1234"
1525302430.091003 [0 172.18.0.7:40058] "DEL" "lock-signal:info:link:testapp:abc1234"
1525302436.258436 [0 172.18.0.6:48408] "SET" "info:link:testapp:abc1234" "someuuid"
1525302436.279744 [0 172.18.0.7:40102] "GET" "lock:info:link:testapp:abc1234"
1525302436.279945 [0 172.18.0.7:40102] "SET" "lock:info:link:testapp:abc1234" "pN\xff\x89\x94\x97\xc8\xc1\xc8HJ\x1d\x13\x8a[E" "EX" "6" "NX"
1525302436.280729 [0 172.18.0.7:40102] "GET" "info:link:testapp:abc1234"
1525302436.283699 [0 172.18.0.7:40102] "PFADD" "stats:redis-monitor:total:lifetime" "info:link:testapp:abc1234:someuuid1525302436.28"
1525302436.286512 [0 172.18.0.7:40102] "PFADD" "stats:redis-monitor:InfoMonitor:lifetime" "info:link:testapp:abc1234:someuuid1525302436.28"
1525302436.289171 [0 172.18.0.7:40102] "DEL" "info:link:testapp:abc1234"
1525302436.289602 [0 172.18.0.7:40102] "EXISTS" "lock:info:link:testapp:abc1234:failures"
1525302436.289730 [0 172.18.0.7:40102] "GET" "lock:info:link:testapp:abc1234"
1525302436.290257 [0 172.18.0.7:40102] "EVALSHA" "dc1e6ea4fc20d70a60aa3ab10780c696e74d541e" "2" "lock:info:link:testapp:abc1234" "lock-signal:info:link:testapp:abc1234" "pN\xff\x89\x94\x97\xc8\xc1\xc8HJ\x1d\x13\x8a[E"
1525302436.290290 [0 lua] "get" "lock:info:link:testapp:abc1234"
1525302436.290301 [0 lua] "del" "lock-signal:info:link:testapp:abc1234"
1525302436.290312 [0 lua] "lpush" "lock-signal:info:link:testapp:abc1234" "1"
1525302436.290323 [0 lua] "del" "lock:info:link:testapp:abc1234"
1525302436.290747 [0 172.18.0.7:40102] "DEL" "lock-signal:info:link:testapp:abc1234"
1525302450.875736 [0 172.18.0.6:48408] "SET" "info:link:testapp:abc1234" "someuuid"
1525302450.967470 [0 172.18.0.7:40056] "GET" "lock:info:link:testapp:abc1234"
1525302450.967620 [0 172.18.0.7:40056] "SET" "lock:info:link:testapp:abc1234" "\xff\x86R\xe1\x1d\xd8L\xf76Ap:\x90\xef\xebL" "EX" "6" "NX"
1525302450.968269 [0 172.18.0.7:40056] "GET" "info:link:testapp:abc1234"
1525302450.970288 [0 172.18.0.7:40056] "PFADD" "stats:redis-monitor:total:lifetime" "info:link:testapp:abc1234:someuuid1525302450.97"
1525302450.972387 [0 172.18.0.7:40056] "PFADD" "stats:redis-monitor:InfoMonitor:lifetime" "info:link:testapp:abc1234:someuuid1525302450.97"
1525302450.974913 [0 172.18.0.7:40056] "DEL" "info:link:testapp:abc1234"
1525302450.975115 [0 172.18.0.7:40056] "EXISTS" "lock:info:link:testapp:abc1234:failures"
1525302450.975371 [0 172.18.0.7:40056] "GET" "lock:info:link:testapp:abc1234"
1525302450.983991 [0 172.18.0.7:40056] "EVALSHA" "dc1e6ea4fc20d70a60aa3ab10780c696e74d541e" "2" "lock:info:link:testapp:abc1234" "lock-signal:info:link:testapp:abc1234" "\xff\x86R\xe1\x1d\xd8L\xf76Ap:\x90\xef\xebL"
1525302450.984020 [0 lua] "get" "lock:info:link:testapp:abc1234"
1525302450.984027 [0 lua] "del" "lock-signal:info:link:testapp:abc1234"
1525302450.984034 [0 lua] "lpush" "lock-signal:info:link:testapp:abc1234" "1"
1525302450.984040 [0 lua] "del" "lock:info:link:testapp:abc1234"
1525302450.984173 [0 172.18.0.7:40056] "DEL" "lock-signal:info:link:testapp:abc1234"
1525302453.140384 [0 172.18.0.6:48408] "SET" "info:link:testapp:abc1234" "someuuid"
1525302453.197155 [0 172.18.0.7:40056] "GET" "lock:info:link:testapp:abc1234"
1525302453.197333 [0 172.18.0.7:40056] "SET" "lock:info:link:testapp:abc1234" "\x9a@k\a\xd2\x1f\xbd<\xb5\x80\xd7\x1d\x99\x1d\xc6\xc3" "EX" "6" "NX"
1525302453.197813 [0 172.18.0.7:40056] "GET" "info:link:testapp:abc1234"
1525302453.199627 [0 172.18.0.7:40056] "PFADD" "stats:redis-monitor:total:lifetime" "info:link:testapp:abc1234:someuuid1525302453.2"
1525302453.201712 [0 172.18.0.7:40056] "PFADD" "stats:redis-monitor:InfoMonitor:lifetime" "info:link:testapp:abc1234:someuuid1525302453.2"
1525302453.204274 [0 172.18.0.7:40056] "DEL" "info:link:testapp:abc1234"
1525302453.204523 [0 172.18.0.7:40056] "EXISTS" "lock:info:link:testapp:abc1234:failures"
1525302453.204734 [0 172.18.0.7:40056] "GET" "lock:info:link:testapp:abc1234"
1525302453.205492 [0 172.18.0.7:40056] "EVALSHA" "dc1e6ea4fc20d70a60aa3ab10780c696e74d541e" "2" "lock:info:link:testapp:abc1234" "lock-signal:info:link:testapp:abc1234" "\x9a@k\a\xd2\x1f\xbd<\xb5\x80\xd7\x1d\x99\x1d\xc6\xc3"
1525302453.205534 [0 lua] "get" "lock:info:link:testapp:abc1234"
1525302453.205549 [0 lua] "del" "lock-signal:info:link:testapp:abc1234"
1525302453.205571 [0 lua] "lpush" "lock-signal:info:link:testapp:abc1234" "1"
1525302453.205581 [0 lua] "del" "lock:info:link:testapp:abc1234"
1525302453.205798 [0 172.18.0.7:40056] "DEL" "lock-signal:info:link:testapp:abc1234"

This seems like a bit of a heisenbug. I've noticed it takes a lot longer with LOG_LEVEL at default/INFO. I just tried for 20 minutes so I could establish an amount of time. Outbound dumped some of its logs (all at once) at about the 20 minute mark and I wasn't willing to continue waiting to see what incoming would do:

First and last request:

> docker exec -i scrapycluster121_kafka_monitor_1 python kafka_monitor.py feed '{"action":"info", "appid":"testapp", "uuid":"someuuid", "crawlid":"abc1234", "spiderid":"link"}'
2018-05-02 22:38:38,059 [kafka-monitor] INFO: Feeding JSON into demo.incoming
{
    "action": "info",
    "crawlid": "abc1234",
    "spiderid": "link",
    "uuid": "someuuid",
    "appid": "testapp"
}
2018-05-02 22:38:38,099 [kafka-monitor] INFO: Successfully fed item to Kafka
...
> docker exec -i scrapycluster121_kafka_monitor_1 python kafka_monitor.py feed '{"action":"info", "appid":"testapp", "uuid":"someuuid", "crawlid":"abc1234", "spiderid":"link"}'
2018-05-02 22:59:36,448 [kafka-monitor] INFO: Feeding JSON into demo.incoming
{
    "action": "info",
    "crawlid": "abc1234",
    "spiderid": "link",
    "uuid": "someuuid",
    "appid": "testapp"
}
2018-05-02 22:59:36,459 [kafka-monitor] INFO: Successfully fed item to Kafka

Suddenly at ~20 minutes:

{u'server_time': 1525300718, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
{u'server_time': 1525300723, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
{u'server_time': 1525300789, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
{u'server_time': 1525300792, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
{u'server_time': 1525300956, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
{u'server_time': 1525300961, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
{u'server_time': 1525301113, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
{u'server_time': 1525301117, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
{u'server_time': 1525301117, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
{u'server_time': 1525301203, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
{u'server_time': 1525301205, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
{u'server_time': 1525301206, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
{u'server_time': 1525301252, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
{u'server_time': 1525301254, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
{u'server_time': 1525301255, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
{u'server_time': 1525301475, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
{u'server_time': 1525301479, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
{u'server_time': 1525301541, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
{u'server_time': 1525301542, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
{u'server_time': 1525301571, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
{u'server_time': 1525301634, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
{u'server_time': 1525301757, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
{u'server_time': 1525301794, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
{u'server_time': 15253017

*Note that the last output is cut off.

Incoming still empty:

> docker exec -i scrapycluster121_kafka_monitor_1 python kafkadump.py dump -t demo.incoming

I'll try to reproduce this delay with DEBUG on to be sure its not entirely a LOG_LEVEL-related issue. I'm fairly certain I've seen it with DEBUG on.

madisonb commented 6 years ago

Keep in mind that Redis != Redis Monitor, they are two different applications and do very different things. It looks like you either have corrupted topics within your kafka cluster, have throttled containers that are running inconsistently, have an incorrect consumer implementation, or some other issue to be determined.

For kicks, on the dev branch you can use the latest kafka container version, you just need to add the KAFKA_ADVERTISED_HOST_NAME: kafka like so

  kafka:
    image: wurstmeister/kafka
    ports:
      - "9092:9092"
    environment:
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_ADVERTISED_HOST_NAME: kafka
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    depends_on:
      - zookeeper
restart: always

When I run docker-compose up -d I don't see any issues, nor experience your kafka problem. You kafka topic at the end there shows that the Redis Monitor (not Redis) was servicing requests from Wednesday, May 2, 2018 10:38:38 PM to Wednesday, May 2, 2018 10:56:34 PM GMT, which makes sense and is roughly 20 minutes.

I would suspect that your Kafka consumer group, or the kafka node itself is not working right for some reason or another. Perhaps you should look at the kafka container logs, however debugging Kafka is not something this repo can help with - this just uses Kafka as the data bus.

cliff-km commented 6 years ago

I just spun this up on a whole new machine on a different service and have the same issue. So I guess at this point what I'm telling you is that a new user who comes to your current docs and tries the quick start instructions for docker is going to 1. run into the Kafka image version issue and, if they go to what seems to be the right earlier image, 2. run into this issue. I don't know what else to say at this point.

madisonb commented 6 years ago

You mentioned earlier that the integration tests work fine - what kafka consumer script are you using? The kafkadump.py script? The cluster seems to be able to communicate with itself, yet your consumer is not showing logs, but randomly spits out data 20 mins later. Everything you have provided me so far show that the data is being sent to kafka (note that you haven't provided the redis monitor logs from your issue), however the redis commands above and the kafka output shows it is operating on them in due time.

Here is my series of commands from the master branch of the repo, with the container versions pinned and the hostname fix I mentioned prior.

madisonb@macbook scrapy-cluster [master] $ git pull
Already up-to-date.

madisonb@macbook scrapy-cluster [master] $ git diff
diff --git a/docker-compose.yml b/docker-compose.yml
index b6855b1..cf36181 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -30,22 +30,23 @@ services:
     ports:
       - "5343:5343"
   redis:
-    image: redis
+    image: redis:4.0.9
     ports:
       - "6379"
     restart: always
   zookeeper:
-    image: zookeeper
+    image: zookeeper:3.4.11
     ports:
       - "2181:2181"
     restart: always
   kafka:
-    image: wurstmeister/kafka
+    image: wurstmeister/kafka:1.1.0
     ports:
       - "9092:9092"
     environment:
       KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
       KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
+      KAFKA_ADVERTISED_HOST_NAME: kafka
     volumes:
       - /var/run/docker.sock:/var/run/docker.sock
     depends_on:
# ends there

madisonb@macbook scrapy-cluster [master] $  docker-compose up -d redis kafka redis_monitor kafka_monitor rest
# stands up fine

madisonb@macbook scrapy-cluster [master] $ docker-compose ps
            Name                           Command              State               Ports
-----------------------------------------------------------------------------------------------------
scrapycluster_kafka_1           start-kafka.sh                  Up      0.0.0.0:9092->9092/tcp
scrapycluster_kafka_monitor_1   python kafka_monitor.py run     Up
scrapycluster_redis_1           docker-entrypoint.sh redis      Up      0.0.0.0:32768->6379/tcp
                                ...
scrapycluster_redis_monitor_1   python redis_monitor.py         Up
scrapycluster_rest_1            python rest_service.py          Up      0.0.0.0:5343->5343/tcp
scrapycluster_zookeeper_1       /docker-entrypoint.sh zkSe      Up      0.0.0.0:2181->2181/tcp,
                                ...                                     2888/tcp, 3888/tcp

madisonb@macbook scrapy-cluster [master] $ curl localhost:5343
{
  "kafka_connected": true,
  "my_id": "a912864756f7",
  "node_health": "GREEN",
  "redis_connected": true,
  "uptime_sec": 45
}

madisonb@macbook scrapy-cluster [master] $ curl localhost:5343/feed -H "Content-Type: application/json" -d '{"action":"info", "appid":"testapp", "uuid":"someuuid", "crawlid":"abc1234", "spiderid":"link"}'
{
  "data": {
    "appid": "testapp",
    "crawlid": "abc1234",
    "domains": {},
    "server_time": 1525365341,
    "spiderid": "link",
    "total_domains": 0,
    "total_pending": 0,
    "uuid": "someuuid"
  },
  "error": null,
  "status": "SUCCESS"
}

The last command shows the following working

This is a more advanced form of your issue above, and seems to be working fine. Can you apply my git diff to your compose file and report the results? Here are the relevant logs to compare yours against (using a vanilla setup, no json)

Rest:

madisonb@macbook scrapy-cluster [master] $ docker-compose logs rest
Attaching to scrapycluster_rest_1
...
rest_1           | 2018-05-03 16:33:16,628 [rest-service] INFO: Connected successfully to Kafka
rest_1           | 2018-05-03 16:33:58,475 [rest-service] INFO: 'index' endpoint called
rest_1           | 172.21.0.1 - - [03/May/2018 16:33:58] "GET / HTTP/1.1" 200 -
rest_1           | 2018-05-03 16:35:41,349 [rest-service] INFO: 'feed' endpoint called
rest_1           | 172.21.0.1 - - [03/May/2018 16:35:41] "POST /feed HTTP/1.1" 200 -

Kafka Monitor:

...
kafka_monitor_1  | 2018-05-03 16:35:41,392 [kafka-monitor] INFO: Added action to Redis
...

Redis Monitor:

...
redis_monitor_1  | 2018-05-03 16:35:41,477 [redis-monitor] INFO: Received info request
redis_monitor_1  | 2018-05-03 16:35:41,480 [redis-monitor] INFO: Sent info to kafka
...

Notice that this whole chain of events happens all within 1 second, requiring 3 different apps to work, and both kafka, redis, and zookeeper to be running. All of this, on my box, indicates a healthy setup and that all the containers are working as intended.

I am happy to add the fix to the master branch (which is to add the kafka hostname to the docker-compose file), but I suspect if you cannot replicate the above steps, I am not sure how much help I can provide unless you give me further steps to replicate.

cliff-km commented 6 years ago

Ok I set up a whole new machine and tried your diff. The process you show works as expected. And to be clear, I'm fairly certain the rest endpoint will work as expected as I had it set up before.

However, I still get no output from kafkadump.py for demo.outbout_firehose and demo.incoming. That is the original issue that I submitted and it still exists. It seems to me that you expect users to use the rest endpoint. That's a fair expectation since in the end that's what I would use.

Yet in the quickstart guide the instructions shows use of kafkadump.py. If kafkadump.py isn't expected to be used and doesn't work consistently maybe the guide should be changed to only use the rest endpoint.

madisonb commented 6 years ago

So you have a number of issues you are trying to solve simultaneously. The rest service uses Kafka behind the scenes, and submits the request you feed it into the topic. We can do the same thing with our current setup to submit and check on crawls. The rest service is not included in the quickstart guide on purpose, my illustration here is that you have a working cluster because the rest service requires more components to be working that what the quickstart needs.

With that, lets look at the crawling setup:

madisonb@macbook scrapy-cluster [master] $ docker-compose up -d redis kafka redis_monitor kafka_monitor rest crawler
# at this point you can just run `up -d`

From here, lets test the crawling setup:

Terminal 1

madisonb@macbook scrapy-cluster [master] $ docker-compose exec kafka_monitor python kafkadump.py dump -t demo.crawled_firehose -p
# let this run

Terminal 2

madisonb@macbook scrapy-cluster [master] $ docker-compose exec kafka_monitor python kafkadump.py dump -t demo.incoming -p
# let this run

Terminal 3

madisonb@macbook scrapy-cluster [master] $ docker-compose exec kafka_monitor python kafka_monitor.py feed '{"url": "http://dmoztools.net", "appid":"testapp", "crawlid":"abc1234", "maxdepth":1}'
2018-05-03 19:54:49,731 [kafka-monitor] INFO: Feeding JSON into demo.incoming
{
    "url": "http://dmoztools.net",
    "maxdepth": 1,
    "crawlid": "abc1234",
    "appid": "testapp"
}
2018-05-03 19:54:49,741 [kafka-monitor] INFO: Successfully fed item to Kafka

Immediately, you see in Terminal 2

{
    "url": "http://dmoztools.net",
    "appid": "testapp",
    "maxdepth": 1,
    "crawlid": "abc1234"
}

In Terminal 1, once the spider is done crawling

{
    "body": "#truncated",
    "crawlid": "abc1234",
    "links": [
        {
            "url": "http://dmoztools.net/",
            "text": "DMOZ"
        },
        {
            "url": "https://www.resource-zone.com/",
            "text": "resource-zone"
        },
       # truncated
    ],
    "response_url": "http://dmoztools.net",
    "url": "http://dmoztools.net",
    "status_code": 200,
    "status_msg": "OK",
    "request_headers": {
        "Accept-Language": [
            "en"
        ],
        "Accept-Encoding": [
            "gzip,deflate"
        ],
        "Accept": [
            "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
        ],
        "User-Agent": [
            "Scrapy/1.3.3 (+http://scrapy.org)"
        ]
    },
    "response_headers": {
        "Via": [
            "1.1 cdb65f6a8787eeb98b9ec160b614ee4a.cloudfront.net (CloudFront)"
        ],
          # truncated
    },
    "timestamp": "2018-05-03T19:54:57.232421",
    "attrs": null,
    "appid": "testapp"
}
# ... truncated, lots more documents

This completes validation that the cluster can crawl.

In a similar manner, you can stop and reuse terminal 1 and change the command to

$ docker-compose exec kafka_monitor python kafkadump.py dump -t demo.outbound_firehose -p
# let this run

And in Terminal 3 feed

madisonb@macbook scrapy-cluster [master] $ docker-compose exec kafka_monitor python kafka_monitor.py feed '{"action":"info", "appid":"testapp", "uuid":"someuuid", "crawlid":"abc1234", "spiderid":"link"}'
2018-05-03 19:59:38,588 [kafka-monitor] INFO: Feeding JSON into demo.incoming
{
    "action": "info",
    "crawlid": "abc1234",
    "spiderid": "link",
    "uuid": "someuuid",
    "appid": "testapp"
}
2018-05-03 19:59:38,600 [kafka-monitor] INFO: Successfully fed item to Kafka

And you get in Terminal 2

{
    "action": "info",
    "appid": "testapp",
    "spiderid": "link",
    "crawlid": "abc1234",
    "uuid": "someuuid"
}

And in Terminal 3

{
    "server_time": 1525377578,
    "crawlid": "abc1234",
    "total_pending": 45,
    "total_domains": 1,
    "spiderid": "link",
    "appid": "testapp",
    "domains": {
        "dmoztools.net": {
            "low_priority": -9,
            "high_priority": -9,
            "total": 45
        }
    },
    "uuid": "someuuid"
}

This completes validation that the cluster can accept info/action style requests.

If you would like you can look at the logs using docker-compose logs -f XXX too. At this point I have demonstrated with the additional environment variable for Kafka earlier, the 1.2.1 cluster operates as expected and can work with the requests you posted above in your original post.

This is a complex project with lots of moving parts, so hopefully you can replicate the above and get back on track with actually running your crawls, rather than debugging configuration issues.

cliff-km commented 6 years ago

Good news I found the missing output issue. It does exist but it may be in docker and not your cluster.

root@scrapy-test:~/scrapy-cluster# docker exec scrapycluster_kafka_monitor_1 python kafkadump.py dump -t demo.incoming -p
# nothing
root@scrapy-test:~/scrapy-cluster# docker-compose exec kafka_monitor python kafkadump.py dump -t demo.incoming -p
{
    "url": "http://dmoztools.net",
    "appid": "testapp",
    "maxdepth": 1,
    "crawlid": "abc1234"
}

As far as I understand those should be roughly equivalent. Clearly they aren't in this case.

This works out the same way for both incoming and outbound. If it's in docker/docker-compose though I'd expect it also wouldn't work for the crawler firehose so it may be something with how the output is being written. This difference has been consistently reproducible for me. I can run it with docker-compose exec and it works 100% then I switch to docker exec and it fails 100%.

In case it's useful:

root@scrapy-test:~/scrapy-cluster# docker version
Client:
 Version:      1.12.6-cs13
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   0ee24d4
 Built:        Mon Jul 24 18:27:43 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.6-cs13
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   0ee24d4
 Built:        Mon Jul 24 18:27:43 2017
 OS/Arch:      linux/amd64
root@scrapy-test:~/scrapy-cluster# docker-compose version
docker-compose version 1.18.0, build 8dd22a9
docker-py version: 2.6.1
CPython version: 2.7.13
OpenSSL version: OpenSSL 1.0.1t  3 May 2016
madisonb commented 6 years ago

I can replicate the issue with exec running your same command, however I think we need the -it flag like so, this works for me.

madisonb@macbook scrapy-cluster [master] $ docker exec -it scrapycluster_kafka_monitor_1 python kafkadump.py dump -t demo.incoming -p
{
    "action": "info",
    "appid": "testapp",
    "spiderid": "link",
    "crawlid": "abc1234",
    "uuid": "someuuid"
}

This is something I normally use if I run docker containers on the fly, but had forgotten about it since I almost always use compose. According to the help for docker exec

madisonb@macbook scrapy-cluster [master] $ docker exec --help

Usage:  docker exec [OPTIONS] CONTAINER COMMAND [ARG...]

Run a command in a running container

Options:
  -d, --detach               Detached mode: run command in the background
      --detach-keys string   Override the key sequence for detaching a container
  -e, --env list             Set environment variables
  -i, --interactive          Keep STDIN open even if not attached
      --privileged           Give extended privileges to the command
  -t, --tty                  Allocate a pseudo-TTY
  -u, --user string          Username or UID (format: <name|uid>[:<group|gid>])
  -w, --workdir string       Working directory inside the container

So, in that case my guess is that docker doesn't normally keep STDIN open nor allocates the pseudo-tty, hence no data coming through (both appear to be required).

If this resolves your issues once and for all, please close the ticket. I will make a note here to review the docker quickstart so this is less confusing for new users.

cliff-km commented 6 years ago

This does fix it, I think the key here is the tty. I was originally using -i which I thought would be fine especially since it works for crawled_firehose.

As a side-note I feel like if you'd tried my steps as written from the beginning we would have saved ourselves a lot of time here. When trying to reproduce issues being exacting down to the letter is important.

arshadolx commented 5 years ago

Hi,

I am facing the same issue in a cluster setup. `(venv) root@kafka:/home/osiuser/scrapy-cluster/redis-monitor# python tests/online.py -v test_process_item (main.TestRedisMonitor) ... ok


Ran 1 test in 16.296s

OK ` I can able to run link spider, but custom crawlers are not running

(venv) root@kafka:/home/osiuser/scrapy-cluster/kafka-monitor# python kafka_monitor.py feed '{"action":"info", "appid":"testapp", "uuid":"someuuid", "crawlid":"abc1234", "spiderid":"link"}' 2018-11-06 17:59:42,313 [kafka-monitor] DEBUG: Logging to stdout 2018-11-06 17:59:42,313 [kafka-monitor] DEBUG: Creating new kafka producer using brokers: localhost:9092 2018-11-06 17:59:42,419 [kafka-monitor] INFO: Feeding JSON into demo.incoming { "action": "info", "crawlid": "abc1234", "spiderid": "link", "uuid": "someuuid", "appid": "testapp" } 2018-11-06 17:59:42,655 [kafka-monitor] INFO: Successfully fed item to Kafka

(venv) root@kafka:/home/osiuser/scrapy-cluster/crawler# scrapy runspider crawling/spiders/test_spider.py 2018-11-06 17:58:12,724 [sc-crawler] INFO: Changed Public IP: None -> b'111.93.8.98'

Thanks, Arshad

madisonb commented 5 years ago

custom crawlers will need additional unit/integration tests to run with the test_spider.py test. If you do not want to write unit or integration tests to ensure things are working as expected, you can feed your crawler the normal way via kafka monitor api and see the resulting logs.