istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 324 forks source link

[Middleware Settings] Integrate scrapy-splash into scrapy-cluster #94

Closed hustshawn closed 7 years ago

hustshawn commented 7 years ago

In single node scrapy project, the settings like below as the scrapy-splash document indicate works well.

# ====== Splash settings ======
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

While if I integrate with the scrapy-cluster with below settings, the request with SplashRequest may not successfully send request to splash, so the splash not will not respond. Actually, the splash itself works fine when I just directly access it with constructed url from render.html endpoint.

SPIDER_MIDDLEWARES = {
    # disable built-in DepthMiddleware, since we do our own
    # depth management per crawl request
    'scrapy.spidermiddlewares.depth.DepthMiddleware': None,
    'crawling.meta_passthrough_middleware.MetaPassthroughMiddleware': 100,
    'crawling.redis_stats_middleware.RedisStatsMiddleware': 105,
    # The original 100 is conflict with the MetaPassthroughMiddleware, thus changed to 101
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 101,
}

DOWNLOADER_MIDDLEWARES = {
    # Handle timeout retries with the redis scheduler and logger
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'crawling.redis_retry_middleware.RedisRetryMiddleware': 510,
    # exceptions processed in reverse order
    'crawling.log_retry_middleware.LogRetryMiddleware': 520,
    # custom cookies to not persist across crawl requests
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': None,
    # 'crawling.custom_cookies.CustomCookiesMiddleware': 700,
    # Scrapy-splash DOWNLOADER_MIDDLEWARES
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

# Scrapy-splash settings
SPLASH_URL = 'scrapy_splash:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

Anyone knows what's going wrong with the settings?

madisonb commented 7 years ago

I don't quite have an in depth knowledge about the scrapy splash integration, but I do see you are rolling your own DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter', which is different that the one scrapy cluster comes with. This may be your issue, since you are changing/overwriting a built in filter.

Also upon looking at your settings, I am hoping that this is a subset of your settings file, given that the scrapy cluster settings file is much larger than what you have displayed here and requires a lot more configuration.

I would point you to the offline testing documentation (to see if everything is functioning) here. Run those tests and be sure everything passes. If they all pass, you can move on to your integration tests here, which tests each component's functionality given you have a scrapy cluster setup. If those all pass, your cluster appears to be working as expected.

Lastly, given that this is a custom configuration issue I point you to the documentation here that asks that you refrain from raising issues about custom configurations, since we cannot support all different kinds of setups or custom enhancements/changes you wish to make to the project. The goal of this project is to be generic so you may fork/customize it to your whim.

You can reach out for support on Gitter for feedback on custom setups, but a general rule is that we are not modifying anything within Scrapy except the scheduler and all normal Scrapy functions should work as expected, including your middlewares.

I would like to close this and move the conversation elsewhere if that is acceptable.

madisonb commented 7 years ago

I also note there that we run a number of custom middlewares in production setups, including ones that reach out to other services, and our only problems we have had ended up being our own doing. Closing this ticket tomorrow if I don't hear anything back.

hustshawn commented 7 years ago

@madisonb yes, you are right. The settings I gave out is just part of the whole settings file, and I think only them are related.

I tried as you pointed the DUPEFILTER_CLASS issue, and it just made no difference.

I really have no idea what may cause the issue. But what I wanna achieve is to able to replace the scrapy.http.Request with scrapy_splash.SplashRequest wherever I need it. The problem is the SplashRequest not able yield request with no error prompt, BTW the debugging on the docker environment is really painful and it may not show where the error/exception occurs.

I have run the offline test, the result is as below.

test_critical_log (test_log_factory.TestLogFactory) ... ok
test_debug_log (test_log_factory.TestLogFactory) ... ok
test_error_log (test_log_factory.TestLogFactory) ... ok
test_info_log (test_log_factory.TestLogFactory) ... ok
test_name_as_property (test_log_factory.TestLogFactory) ... FAIL
test_warn_log (test_log_factory.TestLogFactory) ... ok
test_log_file_json (test_log_factory.TestLogJSONFile) ... ok
test_preserve_extra (test_log_factory.TestLogJSONFile) ... ok
test_over (test_method_timer.TestMethodTimer) ... ok
test_params (test_method_timer.TestMethodTimer) ... ok
test_under (test_method_timer.TestMethodTimer) ... ok
test_clear (test_redis_queue.TestBase) ... ok
test_decode (test_redis_queue.TestBase) ... ERROR
test_encode (test_redis_queue.TestBase) ... ERROR
test_init (test_redis_queue.TestBase) ... ERROR
test_len (test_redis_queue.TestBase) ... ok
test_pop (test_redis_queue.TestBase) ... ok
test_push (test_redis_queue.TestBase) ... ok
test_len (test_redis_queue.TestRedisPriorityQueue) ... ERROR
test_pop (test_redis_queue.TestRedisPriorityQueue) ... ERROR
test_push (test_redis_queue.TestRedisPriorityQueue) ... ERROR
test_len (test_redis_queue.TestRedisQueue) ... ERROR
test_pop (test_redis_queue.TestRedisQueue) ... ERROR
test_push (test_redis_queue.TestRedisQueue) ... ERROR
test_len (test_redis_queue.TestRedisStack) ... ERROR
test_pop (test_redis_queue.TestRedisStack) ... ERROR
test_push (test_redis_queue.TestRedisStack) ... ERROR
test_moderated (test_redis_throttled_queue.TestModeratedRedisThrottledQueue) ... ok
test_unmoderated (test_redis_throttled_queue.TestUnmoderatedRedisThrottledQueue) ... ok
test_load_default (test_settings_wrapper.TestSettingsWrapper) ... ok
test_load_string (test_settings_wrapper.TestSettingsWrapper) ... ok
test_no_defaults (test_settings_wrapper.TestSettingsWrapper) ... ok
test_no_override (test_settings_wrapper.TestSettingsWrapper) ... ok
test_override_default (test_settings_wrapper.TestSettingsWrapper) ... ok
test_default_key (test_stats_collector.TestStatsAbstract) ... ok
test_not_implemented (test_stats_collector.TestStatsAbstract) ... ok
test_overloaded_key (test_stats_collector.TestStatsAbstract) ... ok

======================================================================
ERROR: test_decode (test_redis_queue.TestBase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_redis_queue.py", line 38, in test_decode
    q = Base(MagicMock(), 'key', pickle)
TypeError: __init__() takes exactly 3 arguments (4 given)

======================================================================
ERROR: test_encode (test_redis_queue.TestBase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_redis_queue.py", line 30, in test_encode
    q = Base(MagicMock(), 'key', pickle)
TypeError: __init__() takes exactly 3 arguments (4 given)

======================================================================
ERROR: test_init (test_redis_queue.TestBase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_redis_queue.py", line 21, in test_init
    q1 = Base(MagicMock(), 'key', pickle)
TypeError: __init__() takes exactly 3 arguments (4 given)

======================================================================
ERROR: test_len (test_redis_queue.TestRedisPriorityQueue)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_redis_queue.py", line 64, in setUp
    self.queue = self.queue_class(MagicMock(), 'key', ujson)
TypeError: __init__() takes exactly 3 arguments (4 given)

======================================================================
ERROR: test_pop (test_redis_queue.TestRedisPriorityQueue)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_redis_queue.py", line 64, in setUp
    self.queue = self.queue_class(MagicMock(), 'key', ujson)
TypeError: __init__() takes exactly 3 arguments (4 given)

======================================================================
ERROR: test_push (test_redis_queue.TestRedisPriorityQueue)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_redis_queue.py", line 64, in setUp
    self.queue = self.queue_class(MagicMock(), 'key', ujson)
TypeError: __init__() takes exactly 3 arguments (4 given)

======================================================================
ERROR: test_len (test_redis_queue.TestRedisQueue)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_redis_queue.py", line 64, in setUp
    self.queue = self.queue_class(MagicMock(), 'key', ujson)
TypeError: __init__() takes exactly 3 arguments (4 given)

======================================================================
ERROR: test_pop (test_redis_queue.TestRedisQueue)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_redis_queue.py", line 64, in setUp
    self.queue = self.queue_class(MagicMock(), 'key', ujson)
TypeError: __init__() takes exactly 3 arguments (4 given)

======================================================================
ERROR: test_push (test_redis_queue.TestRedisQueue)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_redis_queue.py", line 64, in setUp
    self.queue = self.queue_class(MagicMock(), 'key', ujson)
TypeError: __init__() takes exactly 3 arguments (4 given)

======================================================================
ERROR: test_len (test_redis_queue.TestRedisStack)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_redis_queue.py", line 64, in setUp
    self.queue = self.queue_class(MagicMock(), 'key', ujson)
TypeError: __init__() takes exactly 3 arguments (4 given)

======================================================================
ERROR: test_pop (test_redis_queue.TestRedisStack)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_redis_queue.py", line 64, in setUp
    self.queue = self.queue_class(MagicMock(), 'key', ujson)
TypeError: __init__() takes exactly 3 arguments (4 given)

======================================================================
ERROR: test_push (test_redis_queue.TestRedisStack)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_redis_queue.py", line 64, in setUp
    self.queue = self.queue_class(MagicMock(), 'key', ujson)
TypeError: __init__() takes exactly 3 arguments (4 given)

======================================================================
FAIL: test_name_as_property (test_log_factory.TestLogFactory)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_log_factory.py", line 24, in test_name_as_property
    self.assertEqual('test', self.logger.name)
AssertionError: 'test' != <bound method LogObject.name of <scutils.log_factory.LogObject object at 0x1015c4c10>>

Name                               Stmts   Miss  Cover   Missing
----------------------------------------------------------------
scutils.py                             0      0   100%   
scutils/log_factory.py               104      4    96%   83, 103-104, 211
scutils/method_timer.py               22      1    95%   59
scutils/redis_queue.py                61     32    48%   3-4, 27, 33, 69, 76, 82-89, 99, 107-109, 117-122, 134, 140, 146-154
scutils/redis_throttled_queue.py      79     18    77%   44, 53, 59, 65-67, 73, 84-87, 139, 149-159
scutils/settings_wrapper.py           61      8    87%   21, 31-34, 48-49, 53-54
scutils/stats_collector.py           199    108    46%   62-66, 83-86, 109-113, 135-139, 162-166, 189-193, 209-218, 242, 248, 275-297, 307-310, 316-319, 325-326, 332-339, 345, 351-357, 363-365, 371-376, 379, 394, 399-401, 404, 407-409, 423-426, 429-430, 433, 436-437, 449-454, 457-460, 463, 466, 476, 486, 489, 492, 503, 513, 516, 519, 531, 539, 542, 545
utils/scutils.py                       8      4    50%   5-8
----------------------------------------------------------------
TOTAL                                534    175    67%   
----------------------------------------------------------------------
Ran 37 tests in 3.101s

FAILED (errors=12, failures=1)
utils tests failed
hustshawn commented 7 years ago

Or is there any other way for scrapy-cluster to handle the Javascript powered dynamic pages?

madisonb commented 7 years ago

From your unit tests above it looks like you are in the dev branch, but do not have the scutils dev package installed. Check the readme out and uninstall 1.1 and install 1.2dev and that should fix your problem in regards to the unit tests.

The docker containers themselves also have the tests built into them, you can run ./run_docker_tests.sh inside the container to check both the offline and online unit tests. You can still use the vagrant machine to run/debug your code as well, you don't have to use Docker.

Lastly, given that scrapy cluster is a wrapper with a modified scheduler around Scrapy, it would be up to scrapy and its associated middlewares to crawl javascript/dynamic pages. Perhaps you should look at the difference between a SplashRequest and a normal Scrapy Request, and take that information to modify:

hustshawn commented 7 years ago

Thanks for your suggestion. The scutils was missing when I tried to apply logs with LogFactory, I just installed the scutils from pip and it works. Since I do not know how to unistall the current version (say 1.1), I just used the above way to fix the scutils missing issue. May I know any difference from 1.1 and 1.2dev? That will make me decide whether to unistall the current version or not, and if really have to, could you please give a guide to me how to unistall and reinstall the cluster.

I also found the request from a new url is next_request() function in the scheduler, while in scrapy there is a start_requests function which defines the way to init the request in the spider. Is that the only way for scrapy_cluster to handle the initial request?

madisonb commented 7 years ago

You won't be able to run dev branch code without the dev version of scutils, your unit tests will fail and the cluster will not work. I would uninstall scutils 1.1 and install the dev version.

Scrapy Cluster uses the Kafka Monitor to feed crawl requests into the cluster, given that the spiders are always idle or processing crawl requests. If you used the start_requests() function I can't help support you given that it goes against the paradigm the project is focused around.

I refer you to the overall documentaiton here or the kafka monitor docs here for more information.