sjdirect / abot

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Apache License 2.0
2.25k stars 560 forks source link

Question: how do HttpServicePointConnectionLimit and MaxConcurrentThreads interact? #178

Closed benempson closed 6 years ago

benempson commented 6 years ago

Hi there, I'm just trying to figure out these 2 properties. My interpretation is that if HttpServicePointConnectionLimit is less than MaxConcurrentThreads, then HttpServicePointConnectionLimit is going to be the limiting factor in the equation.

For example, if HttpServicePointConnectionLimit = 2 and MaxConcurrentThreads = 10 then only 2 concurrent requests are ever going to be made.

Conversely, if HttpServicePointConnectionLimit = 10 and MaxConcurrentThreads = 2 then again only 2 concurrent requests are ever going to be made, albeit for a different reason.

Is this correct? Is there any guidance about which setting to choose to rate limit a crawl?

sjdirect commented 6 years ago

My understanding is similar to yours. HttpServicePointConnectionLimit is defining how many outbound internet connections are allowed, so it should be greater than MaxConcurrentThreads. One caveat to keep in mind is that the page processing (parsing of links, running any rules/checks, analytics, etc...) also run on the threads created for MaxConcurrentThreads so the relation between ServicePointConnectionLimit and MaxConcurrentThreads is a little blurred (ie.. the theoretical math wont always work).

Hope that helps Steven

On Thu, Jan 25, 2018 at 1:46 AM, benArrayx notifications@github.com wrote:

Hi there, I'm just trying to figure out these 2 properties. My interpretation is that if HttpServicePointConnectionLimit is less than MaxConcurrentThreads, then HttpServicePointConnectionLimit is going to be the limiting factor in the equation.

For example, if HttpServicePointConnectionLimit = 2 and MaxConcurrentThreads = 10 then only 2 concurrent requests are ever going to be made.

Conversely, if HttpServicePointConnectionLimit = 10 and MaxConcurrentThreads = 2 then again only 2 concurrent requests are ever going to be made, albeit for a different reason.

Is this correct? Is there any guidance about which setting to choose to rate limit a crawl?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sjdirect/abot/issues/178, or mute the thread https://github.com/notifications/unsubscribe-auth/ADot4hWKSUkx9SgeUAhMQ95qvgnvHXa-ks5tOE15gaJpZM4RslDP .

sjdirect commented 6 years ago

Also, questions like these should go to the forum since its a better format for question/answer.

benempson commented 6 years ago

Thanks for the response Steven, sure I'll go to the forum in future, sorry about that. Just to finish up here, from what you are saying, I think it's best therefore to set both properties to the same value ie. if I only want a maximum of 2 connections to be made, then set both to 2. Do you agree with that?

sjdirect commented 6 years ago

I would say that the service point connection should be GREATER than the maxconcurrentthreads. Any other requests above the actual crawl (for example the robots.txt check or if you are using AbotX ParallelCrawler) would be queued. I would rather configure it high and then let my abot/abotx config limit it further.

On Fri, Jan 26, 2018 at 3:40 AM, benArrayx notifications@github.com wrote:

Thanks for the response Steven, sure I'll go to the forum in future, sorry about that. Just to finish up here, from what you are saying, I think it's best therefore to set both properties to the same value ie. if I only want a maximum of 2 connections to be made, then set both to 2. Do you agree with that?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/sjdirect/abot/issues/178#issuecomment-360763860, or mute the thread https://github.com/notifications/unsubscribe-auth/ADot4qPEz9Rdr2ZAnBAde72_GhJH8-05ks5tObnEgaJpZM4RslDP .