Closed tianhuil closed 4 years ago
Passes tests in all environments except for jessie because https://github.com/scrapy/queuelib/pull/21 has not been deployed to pypi: https://travis-ci.org/scrapy/scrapy/builds/352031724
Merging #3160 into master will decrease coverage by
0.04%
. The diff coverage is57.14%
.
@@ Coverage Diff @@
## master #3160 +/- ##
==========================================
- Coverage 84.49% 84.44% -0.05%
==========================================
Files 167 167
Lines 9376 9401 +25
Branches 1392 1394 +2
==========================================
+ Hits 7922 7939 +17
- Misses 1199 1206 +7
- Partials 255 256 +1
Impacted Files | Coverage Δ | |
---|---|---|
scrapy/core/scheduler.py | 63.28% <57.14%> (+1.14%) |
:arrow_up: |
Thanks for deploying queuelib
to pypi @dangra. It's now passing tests.
Thanks @tianhuil for pushing it forward!
But I'd like to give it more thought before merging; round-robin is an improvement over what we have now, but it is not the best solution (see e.g. https://github.com/scrapy/scrapy/issues/2474#issuecomment-288222104). Round-robin assumes requests from all domains are handled with roughly the same speed, which is not the case in a real world, esp. if you're not doing broad crawl, but crawling a specific list of domains.
@tianhuil Do you have will to work on proposed changes?
I'm happy to make the smaller changes proposed by @kmike but I thought this was held up by considerations of https://github.com/scrapy/scrapy/pull/3160#issuecomment-372788831
Let me know if I should proceed with everything except for https://github.com/scrapy/scrapy/pull/3160#issuecomment-372788831
Considerations in https://github.com/scrapy/scrapy/pull/3160#issuecomment-372788831are reasonable. At the same time. this PR is a prerequisite to address them. After this PR landed in trunk we can continue work on smarter way of scheduling requests.
OK -- I"ll respond to everything except for https://github.com/scrapy/scrapy/pull/3160#issuecomment-372788831 to help move things forward. Happy to do it. Give me a little bit.
Sorry for the delay. Responded to both of @kmike's helpful comments. Let me know if you would like anything else from me.
Is there any news on this? I've been eagerly awaiting this feature because I'm also seeing scrapy get stuck on a single domain in a targeted crawl.
@kmike: let me know what your thoughts are here. Would appreciate a fast response as there are merge conflicts popping up.
@kmike: let me know what your thoughts are: would love to get this merged before more conflicts crop up.
hey @tianhuil, I've left some more comments. Besides them, there are the following issues:
@kmike: it's I've documented that priorities have not been used. Looks like it's passing.
@kmike: let me know if you want to merge this. Would like to do it before conflicts re-emerge.
@kmike: let me know if you have any other questions or comments. Would like to do it before conflicts re-emerge.
@kmike: let me know if you have any other questions or comments. Would like to do it before conflicts re-emerge.
Hey @tianhuil! Sorry for going back and forth.
My main concern is a lack of support for request priorities. There are implementations which support request priorities - in fact, all other implementations I'm aware of (mentioned in previous tickets) support request priorities - they are incomplete in different ways though.
It is not only about user-facing feature of allowing users to set custom priorities for different Requests, like this:
scrapy.Request(url, self.parse, priority=5) # won't work
Scrapy uses request priorities itself, internally - they are used for retries and redirects; they are also a base of DepthMiddleware.
Not having priority support may cause strange issues: for example, redirects won't necessarily be on top of the queue, so any redirect can potentially go to the end of the crawl, and, depending on a spider, may be never crawled - it seems to be the case if user uses FIFO queues, as recommended in FAQ section on how to configure BFO crawl. Actually, BFO won't work at all, as DEPTH_PRIORITY won't work.
Third-party packages may also rely on working request priorities; for example, https://github.com/scrapy-plugins/scrapy-splash will work badly with queues which don't support priorities.
Documentation should be updated in several more places, maybe even regardless of the implementation:
@kmike: can you provide a clear set of specifications you want this PR to meet? I'm happy to work on this but I don't have time to chase a moving target.
@tianhuil I'm fine with almost everything which makes things better and keeps the existing features working. I haven't realized at first that not supporting priorities is not just a problem with priority
argument, it is also a problem for retries, redirects, BFO, scrapy-splash, etc - sorry about that. So I think RoundRobin with support for priorities, adequately tested and documented, is good. There are several ways to support priorities in Round-Robin, such as:
(1) makes more sense to me as a default. We just need to make sure it is clearly documented.
Eventually we should be aiming for an even more efficient scheduler. From my point of view, we're fixing a bug here: Scrapy may under-utilize the downloader by filling it with requests it can't download because of its own concurrency limits.
Round-robin is a solution, but a partial one: it assumes all domains can be downloaded with the same speed. It is not always the case. For example, let's say we have domains A and B, both with many requests in a queue, and request can be finished twice as fast for domain A. Global concurrency is set to 4, and per-domain concurrency is limited to 2. With Round-Robin downloader queue will be filled like this:
____ -> ABAB
: requests are added in a round-robin fashion, total size is 4._BAB -> ABAB
: A is downloaded, next request is for A, we add it to the queueAB_B -> ABBB
: A is downloaded (it is twice as fast), B is added according to round-robin rulesA_BB -> AABB
: B is downloaded, we add A_ABB -> BABB
: A is downloaded, we add BB_BB -> BABB
: A is downloaded, we add ABAB_ -> BABB
: B is downloaded, we add B
...Note how we almost always have less A requests in the downloader queue - one instead of optimal two; it means in this case there are no parallel requests to domain A, while user requested 2 parallel requests. At the same time, there are more B requests (usually 3) than can be downloaded in parallel (as the limit is 2), they're just sitting in the downloader queue, taking slots which could have been used for other domains.
In broad crawls it is rather common to get some domains which can timeout on all requests. Maybe some DNS issues. Default Scrapy timeout is 180s; with 2 retries it is 180*3=540s to process a request in a worst case. If for the "average" domain speed is 1s, and we download 100 domains in parallel, we may easily get downloader queue filled 4/5 (or so, I haven't estimated it properly) with requests from this slow domain, leaving less opportunity for parallelization of other domains, and so crawling slower than possible for no good reason.
To solve this issue scheduler should avoid adding request to the downloader if there is no free slot for it. This is how it works in frontera. Round Robin can be modified to do the same - and that's the reason I'm fine with merging round-robin implementation - it can be modified to support this feature in future. For example, it can keep track of the current downloader state using request_reached_downloader
(see https://github.com/scrapy/scrapy/pull/3393) and other signals, and skip adding a request from a domain if adding it would make Downloader go over concurrency limit for this domain. Instead of signals we may also query Downloader status directly.
Hello, is there any progress on this? We have an application where we have a targeted crawl over a small number of domains. Performance is currently very poor, as the queue quickly gets filled with a single slow domain.
We don't need priorities or anything fancy. A limited, 90% solution today would be infinitely preferable to a perfect 100% solution in a year (this review has already taken 7 months).
hey @akirmse! There are third-party solutions which you can use without changing Scrapy. For example, you can use RoundRobinPriorityQueue from https://github.com/TeamHG-Memex/linkdepth/blob/master/queues.py:
SCHEDULER_PRIORITY_QUEUE = 'queues.RoundRobinPriorityQueue'
SCHEDULER_DISK_QUEUE = 'queues.DiskQueue'
It actually supports priorities (so all Scrapy features works properly), but has other limitations - crawl resume is not supported. Priorities is not something fancy, they are needed e.g. for retries, redirects, BFO crawling, scrapy-splash to work properly.
There are also other workarounds, e.g. using frontera. It is not like this is impossible to do with Scrapy, and it is not even a lot of code to get a partial solution which doesn't need any changes in Scrapy core. That's why I'd prefer to have something proper, integrated well with all Scrapy features, if we're going to make a change in Scrapy itself.
@kmike Can this be merged in? @tianhuil makes scrapy usable for more than one domain. The current version is unusable due to the scheduler. As this isn't the default scheduler, declaring it beta and getting usage would solve a lot of issues that have existed for years.
@amonroe https://github.com/scrapy/scrapy/pull/3520 is much closer to get merged at the moment; it solves the same problem, but covers more edge cases, has more tests, and has better behavior - looking for downloader activity instead of using round-robin can speed up crawling significantly in many practical cases.
Thanks @kmike. I'll try that PR as well. So far #3160 has started to show some (not much) balancing across domains after having been blackholed on one for a day. Prior backlog is probably impacting results. No issues as of yet.
Closing it, as https://github.com/scrapy/scrapy/pull/3520 is merged. Thanks @tianhuil for getting the ball rolling!
Merges DomainScheduler implemented in https://github.com/tianhuil/domain_scheduler into scrapy. It scrapes in a domain-smart way: by round-robin cycling through the domains. This has two benefits:
CONCURRENT_REQUESTS_PER_IP
restrictions. Empirical testing has shown this to be quite effective.It implements the solution proposed in https://github.com/scrapy/scrapy/issues/1802#issue-135260562 which found similar performance improvements. Original proposal was first posted https://github.com/scrapy/scrapy/issues/1802 (code no longer available), https://github.com/scrapy/scrapy/issues/2474, and https://github.com/scrapy/scrapy/issues/3140.
Note: It requires more than just using
SCHEDULER_PRIORITY_QUEUE
as it needs an API change to the queue (passing in a non-integer key, i.e. the domain, to a new round robin queue).Would love to get feedback @pablohoffman, @dangra, and @kmike!