scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

Adding domain #3160

Closed tianhuil closed 4 years ago

tianhuil commented 6 years ago

Merges DomainScheduler implemented in https://github.com/tianhuil/domain_scheduler into scrapy. It scrapes in a domain-smart way: by round-robin cycling through the domains. This has two benefits:

  1. Spreading out load on the target servers instead of hitting the server with many requests at once
  2. Reducing delays caused by server-throttling or scrapy's own CONCURRENT_REQUESTS_PER_IP restrictions. Empirical testing has shown this to be quite effective.

It implements the solution proposed in https://github.com/scrapy/scrapy/issues/1802#issue-135260562 which found similar performance improvements. Original proposal was first posted https://github.com/scrapy/scrapy/issues/1802 (code no longer available), https://github.com/scrapy/scrapy/issues/2474, and https://github.com/scrapy/scrapy/issues/3140.

Note: It requires more than just using SCHEDULER_PRIORITY_QUEUE as it needs an API change to the queue (passing in a non-integer key, i.e. the domain, to a new round robin queue).

Would love to get feedback @pablohoffman, @dangra, and @kmike!

tianhuil commented 6 years ago

Passes tests in all environments except for jessie because https://github.com/scrapy/queuelib/pull/21 has not been deployed to pypi: https://travis-ci.org/scrapy/scrapy/builds/352031724

codecov[bot] commented 6 years ago

Codecov Report

Merging #3160 into master will decrease coverage by 0.04%. The diff coverage is 57.14%.

@@            Coverage Diff             @@
##           master    #3160      +/-   ##
==========================================
- Coverage   84.49%   84.44%   -0.05%     
==========================================
  Files         167      167              
  Lines        9376     9401      +25     
  Branches     1392     1394       +2     
==========================================
+ Hits         7922     7939      +17     
- Misses       1199     1206       +7     
- Partials      255      256       +1
Impacted Files Coverage Δ
scrapy/core/scheduler.py 63.28% <57.14%> (+1.14%) :arrow_up:
tianhuil commented 6 years ago

Thanks for deploying queuelib to pypi @dangra. It's now passing tests.

kmike commented 6 years ago

Thanks @tianhuil for pushing it forward!

But I'd like to give it more thought before merging; round-robin is an improvement over what we have now, but it is not the best solution (see e.g. https://github.com/scrapy/scrapy/issues/2474#issuecomment-288222104). Round-robin assumes requests from all domains are handled with roughly the same speed, which is not the case in a real world, esp. if you're not doing broad crawl, but crawling a specific list of domains.

whalebot-helmsman commented 5 years ago

@tianhuil Do you have will to work on proposed changes?

tianhuil commented 5 years ago

I'm happy to make the smaller changes proposed by @kmike but I thought this was held up by considerations of https://github.com/scrapy/scrapy/pull/3160#issuecomment-372788831

Let me know if I should proceed with everything except for https://github.com/scrapy/scrapy/pull/3160#issuecomment-372788831

whalebot-helmsman commented 5 years ago

Considerations in https://github.com/scrapy/scrapy/pull/3160#issuecomment-372788831are reasonable. At the same time. this PR is a prerequisite to address them. After this PR landed in trunk we can continue work on smarter way of scheduling requests.

tianhuil commented 5 years ago

OK -- I"ll respond to everything except for https://github.com/scrapy/scrapy/pull/3160#issuecomment-372788831 to help move things forward. Happy to do it. Give me a little bit.

tianhuil commented 5 years ago

Sorry for the delay. Responded to both of @kmike's helpful comments. Let me know if you would like anything else from me.

akirmse commented 5 years ago

Is there any news on this? I've been eagerly awaiting this feature because I'm also seeing scrapy get stuck on a single domain in a targeted crawl.

tianhuil commented 5 years ago

@kmike: let me know what your thoughts are here. Would appreciate a fast response as there are merge conflicts popping up.

tianhuil commented 5 years ago

@kmike: let me know what your thoughts are: would love to get this merged before more conflicts crop up.

kmike commented 5 years ago

hey @tianhuil, I've left some more comments. Besides them, there are the following issues:

  1. Merge conflicts should be resolved.
  2. Tests should pass.
  3. It is still not documented that request priorities don't work. I'd prefer them working, but if they don't, it should be clearly documented, both in scheduler docs and in Request docs.
tianhuil commented 5 years ago

@kmike: it's I've documented that priorities have not been used. Looks like it's passing.

tianhuil commented 5 years ago

@kmike: let me know if you want to merge this. Would like to do it before conflicts re-emerge.

tianhuil commented 5 years ago

@kmike: let me know if you have any other questions or comments. Would like to do it before conflicts re-emerge.

tianhuil commented 5 years ago

@kmike: let me know if you have any other questions or comments. Would like to do it before conflicts re-emerge.

kmike commented 5 years ago

Hey @tianhuil! Sorry for going back and forth.

My main concern is a lack of support for request priorities. There are implementations which support request priorities - in fact, all other implementations I'm aware of (mentioned in previous tickets) support request priorities - they are incomplete in different ways though.

It is not only about user-facing feature of allowing users to set custom priorities for different Requests, like this:

scrapy.Request(url, self.parse, priority=5)  # won't work

Scrapy uses request priorities itself, internally - they are used for retries and redirects; they are also a base of DepthMiddleware.

Not having priority support may cause strange issues: for example, redirects won't necessarily be on top of the queue, so any redirect can potentially go to the end of the crawl, and, depending on a spider, may be never crawled - it seems to be the case if user uses FIFO queues, as recommended in FAQ section on how to configure BFO crawl. Actually, BFO won't work at all, as DEPTH_PRIORITY won't work.

Third-party packages may also rely on working request priorities; for example, https://github.com/scrapy-plugins/scrapy-splash will work badly with queues which don't support priorities.

Documentation should be updated in several more places, maybe even regardless of the implementation:

tianhuil commented 5 years ago

@kmike: can you provide a clear set of specifications you want this PR to meet? I'm happy to work on this but I don't have time to chase a moving target.

kmike commented 5 years ago

@tianhuil I'm fine with almost everything which makes things better and keeps the existing features working. I haven't realized at first that not supporting priorities is not just a problem with priority argument, it is also a problem for retries, redirects, BFO, scrapy-splash, etc - sorry about that. So I think RoundRobin with support for priorities, adequately tested and documented, is good. There are several ways to support priorities in Round-Robin, such as:

  1. Priorities are respected for each domain, but domain is chosen in a round-robin fashion.
  2. Only requests of the same priority are chosen in a round-robin fashion; priorities are respected globally.

(1) makes more sense to me as a default. We just need to make sure it is clearly documented.


Eventually we should be aiming for an even more efficient scheduler. From my point of view, we're fixing a bug here: Scrapy may under-utilize the downloader by filling it with requests it can't download because of its own concurrency limits.

Round-robin is a solution, but a partial one: it assumes all domains can be downloaded with the same speed. It is not always the case. For example, let's say we have domains A and B, both with many requests in a queue, and request can be finished twice as fast for domain A. Global concurrency is set to 4, and per-domain concurrency is limited to 2. With Round-Robin downloader queue will be filled like this:

  1. ____ -> ABAB: requests are added in a round-robin fashion, total size is 4.
  2. _BAB -> ABAB: A is downloaded, next request is for A, we add it to the queue
  3. AB_B -> ABBB: A is downloaded (it is twice as fast), B is added according to round-robin rules
  4. A_BB -> AABB: B is downloaded, we add A
  5. _ABB -> BABB: A is downloaded, we add B
  6. B_BB -> BABB: A is downloaded, we add A
  7. BAB_ -> BABB: B is downloaded, we add B ...

Note how we almost always have less A requests in the downloader queue - one instead of optimal two; it means in this case there are no parallel requests to domain A, while user requested 2 parallel requests. At the same time, there are more B requests (usually 3) than can be downloaded in parallel (as the limit is 2), they're just sitting in the downloader queue, taking slots which could have been used for other domains.

In broad crawls it is rather common to get some domains which can timeout on all requests. Maybe some DNS issues. Default Scrapy timeout is 180s; with 2 retries it is 180*3=540s to process a request in a worst case. If for the "average" domain speed is 1s, and we download 100 domains in parallel, we may easily get downloader queue filled 4/5 (or so, I haven't estimated it properly) with requests from this slow domain, leaving less opportunity for parallelization of other domains, and so crawling slower than possible for no good reason.

To solve this issue scheduler should avoid adding request to the downloader if there is no free slot for it. This is how it works in frontera. Round Robin can be modified to do the same - and that's the reason I'm fine with merging round-robin implementation - it can be modified to support this feature in future. For example, it can keep track of the current downloader state using request_reached_downloader (see https://github.com/scrapy/scrapy/pull/3393) and other signals, and skip adding a request from a domain if adding it would make Downloader go over concurrency limit for this domain. Instead of signals we may also query Downloader status directly.

akirmse commented 5 years ago

Hello, is there any progress on this? We have an application where we have a targeted crawl over a small number of domains. Performance is currently very poor, as the queue quickly gets filled with a single slow domain.

We don't need priorities or anything fancy. A limited, 90% solution today would be infinitely preferable to a perfect 100% solution in a year (this review has already taken 7 months).

kmike commented 5 years ago

hey @akirmse! There are third-party solutions which you can use without changing Scrapy. For example, you can use RoundRobinPriorityQueue from https://github.com/TeamHG-Memex/linkdepth/blob/master/queues.py:

SCHEDULER_PRIORITY_QUEUE = 'queues.RoundRobinPriorityQueue'
SCHEDULER_DISK_QUEUE = 'queues.DiskQueue'

It actually supports priorities (so all Scrapy features works properly), but has other limitations - crawl resume is not supported. Priorities is not something fancy, they are needed e.g. for retries, redirects, BFO crawling, scrapy-splash to work properly.

There are also other workarounds, e.g. using frontera. It is not like this is impossible to do with Scrapy, and it is not even a lot of code to get a partial solution which doesn't need any changes in Scrapy core. That's why I'd prefer to have something proper, integrated well with all Scrapy features, if we're going to make a change in Scrapy itself.

amonroe commented 5 years ago

@kmike Can this be merged in? @tianhuil makes scrapy usable for more than one domain. The current version is unusable due to the scheduler. As this isn't the default scheduler, declaring it beta and getting usage would solve a lot of issues that have existed for years.

kmike commented 5 years ago

@amonroe https://github.com/scrapy/scrapy/pull/3520 is much closer to get merged at the moment; it solves the same problem, but covers more edge cases, has more tests, and has better behavior - looking for downloader activity instead of using round-robin can speed up crawling significantly in many practical cases.

amonroe commented 5 years ago

Thanks @kmike. I'll try that PR as well. So far #3160 has started to show some (not much) balancing across domains after having been blackholed on one for a day. Prior backlog is probably impacting results. No issues as of yet.

kmike commented 4 years ago

Closing it, as https://github.com/scrapy/scrapy/pull/3520 is merged. Thanks @tianhuil for getting the ball rolling!