Closed hustshawn closed 7 years ago
I don't quite have an in depth knowledge about the scrapy splash integration, but I do see you are rolling your own DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
, which is different that the one scrapy cluster comes with. This may be your issue, since you are changing/overwriting a built in filter.
Also upon looking at your settings, I am hoping that this is a subset of your settings file, given that the scrapy cluster settings file is much larger than what you have displayed here and requires a lot more configuration.
I would point you to the offline testing documentation (to see if everything is functioning) here. Run those tests and be sure everything passes. If they all pass, you can move on to your integration tests here, which tests each component's functionality given you have a scrapy cluster setup. If those all pass, your cluster appears to be working as expected.
Lastly, given that this is a custom configuration issue I point you to the documentation here that asks that you refrain from raising issues about custom configurations, since we cannot support all different kinds of setups or custom enhancements/changes you wish to make to the project. The goal of this project is to be generic so you may fork/customize it to your whim.
You can reach out for support on Gitter for feedback on custom setups, but a general rule is that we are not modifying anything within Scrapy except the scheduler and all normal Scrapy functions should work as expected, including your middlewares.
I would like to close this and move the conversation elsewhere if that is acceptable.
I also note there that we run a number of custom middlewares in production setups, including ones that reach out to other services, and our only problems we have had ended up being our own doing. Closing this ticket tomorrow if I don't hear anything back.
@madisonb yes, you are right. The settings I gave out is just part of the whole settings file, and I think only them are related.
I tried as you pointed the DUPEFILTER_CLASS
issue, and it just made no difference.
I really have no idea what may cause the issue. But what I wanna achieve is to able to replace the scrapy.http.Request
with scrapy_splash.SplashRequest
wherever I need it. The problem is the SplashRequest
not able yield request with no error prompt, BTW the debugging on the docker environment is really painful and it may not show where the error/exception occurs.
I have run the offline test, the result is as below.
test_critical_log (test_log_factory.TestLogFactory) ... ok
test_debug_log (test_log_factory.TestLogFactory) ... ok
test_error_log (test_log_factory.TestLogFactory) ... ok
test_info_log (test_log_factory.TestLogFactory) ... ok
test_name_as_property (test_log_factory.TestLogFactory) ... FAIL
test_warn_log (test_log_factory.TestLogFactory) ... ok
test_log_file_json (test_log_factory.TestLogJSONFile) ... ok
test_preserve_extra (test_log_factory.TestLogJSONFile) ... ok
test_over (test_method_timer.TestMethodTimer) ... ok
test_params (test_method_timer.TestMethodTimer) ... ok
test_under (test_method_timer.TestMethodTimer) ... ok
test_clear (test_redis_queue.TestBase) ... ok
test_decode (test_redis_queue.TestBase) ... ERROR
test_encode (test_redis_queue.TestBase) ... ERROR
test_init (test_redis_queue.TestBase) ... ERROR
test_len (test_redis_queue.TestBase) ... ok
test_pop (test_redis_queue.TestBase) ... ok
test_push (test_redis_queue.TestBase) ... ok
test_len (test_redis_queue.TestRedisPriorityQueue) ... ERROR
test_pop (test_redis_queue.TestRedisPriorityQueue) ... ERROR
test_push (test_redis_queue.TestRedisPriorityQueue) ... ERROR
test_len (test_redis_queue.TestRedisQueue) ... ERROR
test_pop (test_redis_queue.TestRedisQueue) ... ERROR
test_push (test_redis_queue.TestRedisQueue) ... ERROR
test_len (test_redis_queue.TestRedisStack) ... ERROR
test_pop (test_redis_queue.TestRedisStack) ... ERROR
test_push (test_redis_queue.TestRedisStack) ... ERROR
test_moderated (test_redis_throttled_queue.TestModeratedRedisThrottledQueue) ... ok
test_unmoderated (test_redis_throttled_queue.TestUnmoderatedRedisThrottledQueue) ... ok
test_load_default (test_settings_wrapper.TestSettingsWrapper) ... ok
test_load_string (test_settings_wrapper.TestSettingsWrapper) ... ok
test_no_defaults (test_settings_wrapper.TestSettingsWrapper) ... ok
test_no_override (test_settings_wrapper.TestSettingsWrapper) ... ok
test_override_default (test_settings_wrapper.TestSettingsWrapper) ... ok
test_default_key (test_stats_collector.TestStatsAbstract) ... ok
test_not_implemented (test_stats_collector.TestStatsAbstract) ... ok
test_overloaded_key (test_stats_collector.TestStatsAbstract) ... ok
======================================================================
ERROR: test_decode (test_redis_queue.TestBase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_redis_queue.py", line 38, in test_decode
q = Base(MagicMock(), 'key', pickle)
TypeError: __init__() takes exactly 3 arguments (4 given)
======================================================================
ERROR: test_encode (test_redis_queue.TestBase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_redis_queue.py", line 30, in test_encode
q = Base(MagicMock(), 'key', pickle)
TypeError: __init__() takes exactly 3 arguments (4 given)
======================================================================
ERROR: test_init (test_redis_queue.TestBase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_redis_queue.py", line 21, in test_init
q1 = Base(MagicMock(), 'key', pickle)
TypeError: __init__() takes exactly 3 arguments (4 given)
======================================================================
ERROR: test_len (test_redis_queue.TestRedisPriorityQueue)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_redis_queue.py", line 64, in setUp
self.queue = self.queue_class(MagicMock(), 'key', ujson)
TypeError: __init__() takes exactly 3 arguments (4 given)
======================================================================
ERROR: test_pop (test_redis_queue.TestRedisPriorityQueue)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_redis_queue.py", line 64, in setUp
self.queue = self.queue_class(MagicMock(), 'key', ujson)
TypeError: __init__() takes exactly 3 arguments (4 given)
======================================================================
ERROR: test_push (test_redis_queue.TestRedisPriorityQueue)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_redis_queue.py", line 64, in setUp
self.queue = self.queue_class(MagicMock(), 'key', ujson)
TypeError: __init__() takes exactly 3 arguments (4 given)
======================================================================
ERROR: test_len (test_redis_queue.TestRedisQueue)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_redis_queue.py", line 64, in setUp
self.queue = self.queue_class(MagicMock(), 'key', ujson)
TypeError: __init__() takes exactly 3 arguments (4 given)
======================================================================
ERROR: test_pop (test_redis_queue.TestRedisQueue)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_redis_queue.py", line 64, in setUp
self.queue = self.queue_class(MagicMock(), 'key', ujson)
TypeError: __init__() takes exactly 3 arguments (4 given)
======================================================================
ERROR: test_push (test_redis_queue.TestRedisQueue)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_redis_queue.py", line 64, in setUp
self.queue = self.queue_class(MagicMock(), 'key', ujson)
TypeError: __init__() takes exactly 3 arguments (4 given)
======================================================================
ERROR: test_len (test_redis_queue.TestRedisStack)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_redis_queue.py", line 64, in setUp
self.queue = self.queue_class(MagicMock(), 'key', ujson)
TypeError: __init__() takes exactly 3 arguments (4 given)
======================================================================
ERROR: test_pop (test_redis_queue.TestRedisStack)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_redis_queue.py", line 64, in setUp
self.queue = self.queue_class(MagicMock(), 'key', ujson)
TypeError: __init__() takes exactly 3 arguments (4 given)
======================================================================
ERROR: test_push (test_redis_queue.TestRedisStack)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_redis_queue.py", line 64, in setUp
self.queue = self.queue_class(MagicMock(), 'key', ujson)
TypeError: __init__() takes exactly 3 arguments (4 given)
======================================================================
FAIL: test_name_as_property (test_log_factory.TestLogFactory)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/shawnzhang/Google Drive/src/scrapy-cluster/utils/tests/test_log_factory.py", line 24, in test_name_as_property
self.assertEqual('test', self.logger.name)
AssertionError: 'test' != <bound method LogObject.name of <scutils.log_factory.LogObject object at 0x1015c4c10>>
Name Stmts Miss Cover Missing
----------------------------------------------------------------
scutils.py 0 0 100%
scutils/log_factory.py 104 4 96% 83, 103-104, 211
scutils/method_timer.py 22 1 95% 59
scutils/redis_queue.py 61 32 48% 3-4, 27, 33, 69, 76, 82-89, 99, 107-109, 117-122, 134, 140, 146-154
scutils/redis_throttled_queue.py 79 18 77% 44, 53, 59, 65-67, 73, 84-87, 139, 149-159
scutils/settings_wrapper.py 61 8 87% 21, 31-34, 48-49, 53-54
scutils/stats_collector.py 199 108 46% 62-66, 83-86, 109-113, 135-139, 162-166, 189-193, 209-218, 242, 248, 275-297, 307-310, 316-319, 325-326, 332-339, 345, 351-357, 363-365, 371-376, 379, 394, 399-401, 404, 407-409, 423-426, 429-430, 433, 436-437, 449-454, 457-460, 463, 466, 476, 486, 489, 492, 503, 513, 516, 519, 531, 539, 542, 545
utils/scutils.py 8 4 50% 5-8
----------------------------------------------------------------
TOTAL 534 175 67%
----------------------------------------------------------------------
Ran 37 tests in 3.101s
FAILED (errors=12, failures=1)
utils tests failed
Or is there any other way for scrapy-cluster to handle the Javascript powered dynamic pages?
From your unit tests above it looks like you are in the dev
branch, but do not have the scutils
dev
package installed. Check the readme out and uninstall 1.1 and install 1.2dev and that should fix your problem in regards to the unit tests.
The docker containers themselves also have the tests built into them, you can run ./run_docker_tests.sh
inside the container to check both the offline and online unit tests. You can still use the vagrant machine to run/debug your code as well, you don't have to use Docker.
Lastly, given that scrapy cluster is a wrapper with a modified scheduler around Scrapy, it would be up to scrapy and its associated middlewares to crawl javascript/dynamic pages. Perhaps you should look at the difference between a SplashRequest and a normal Scrapy Request, and take that information to modify:
Thanks for your suggestion. The scutils
was missing when I tried to apply logs with LogFactory
, I just installed the scutils
from pip
and it works. Since I do not know how to unistall the current version (say 1.1), I just used the above way to fix the scutils
missing issue. May I know any difference from 1.1 and 1.2dev? That will make me decide whether to unistall the current version or not, and if really have to, could you please give a guide to me how to unistall and reinstall the cluster.
I also found the request from a new url is next_request()
function in the scheduler, while in scrapy
there is a start_requests
function which defines the way to init the request in the spider. Is that the only way for scrapy_cluster
to handle the initial request?
You won't be able to run dev
branch code without the dev
version of scutils, your unit tests will fail and the cluster will not work. I would uninstall scutils 1.1 and install the dev version.
Scrapy Cluster uses the Kafka Monitor to feed crawl requests into the cluster, given that the spiders are always idle or processing crawl requests. If you used the start_requests()
function I can't help support you given that it goes against the paradigm the project is focused around.
I refer you to the overall documentaiton here or the kafka monitor docs here for more information.
In single node scrapy project, the settings like below as the scrapy-splash document indicate works well.
While if I integrate with the
scrapy-cluster
with below settings, the request withSplashRequest
may not successfully send request tosplash
, so thesplash
not will not respond. Actually, thesplash
itself works fine when I just directly access it with constructed url fromrender.html
endpoint.Anyone knows what's going wrong with the settings?