opensangja / abot

Automatically exported from code.google.com/p/abot
Apache License 2.0
0 stars 0 forks source link

Very slow #121

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I'm having trouble using Abot, simply because it is so slow. I wrote a quick 
crawler in about an hour, and although it wasn't efficient, it was about to get 
50-100 pages/second, versus Abot's 1 or 2. I've played around with the config, 
but I'm unable to figure out why it's so slow!

Original issue reported on code.google.com by P...@stephendownward.ca on 18 Dec 2013 at 12:31

GoogleCodeExporter commented 9 years ago
Hi,

Are you running the Abot.Demo application? If not please send me your
config file. If you are this is most likely your problem...

The demo project has a few config values set that greatly limit Abot's
speed. This is to make sure you don't get banned by your isp provider or get
blocked by the sites you are crawling. These setting are..

<abot>
    <politeness
      ...(excluded)
      minCrawlDelayPerDomainMilliSeconds="1000"
      ...(excluded)
      />
  </abot>

Change it to...

<abot>
    <politeness
      ...(excluded)
      minCrawlDelayPerDomainMilliSeconds="0"
      ...(excluded)
      />

  </abot>

This tells abot to not wait in between crawl requests.

Original comment by sjdir...@gmail.com on 18 Dec 2013 at 1:44

GoogleCodeExporter commented 9 years ago
Here is what I have.

 CrawlConfiguration crawlConfig = new CrawlConfiguration();
            crawlConfig.CrawlTimeoutSeconds = 100;
            crawlConfig.MaxConcurrentThreads = 10;
            crawlConfig.MaxPagesToCrawl = 1000;
            crawlConfig.UserAgentString = "Test";
            crawlConfig.MinCrawlDelayPerDomainMilliSeconds = 0;

Original comment by P...@stephendownward.ca on 18 Dec 2013 at 9:45

GoogleCodeExporter commented 9 years ago
On v1.1.1 i updated the Abot.Demo.Program.cs file's GetDefaultWebCrawler() to 
match what you have above, however, I don't see any slowness. its crawling 
50-100 pages per sec. See attached log file. 

        private static IWebCrawler GetDefaultWebCrawler()
        {
            CrawlConfiguration crawlConfig = new CrawlConfiguration();
            crawlConfig.CrawlTimeoutSeconds = 100;
            crawlConfig.MaxConcurrentThreads = 10;
            crawlConfig.MaxPagesToCrawl = 1000;
            crawlConfig.UserAgentString = "Test";
            crawlConfig.MinCrawlDelayPerDomainMilliSeconds = 0;
            return new PoliteWebCrawler(crawlConfig, null ,null ,null ,null ,null ,null ,null ,null);
        }

Can do a fresh checkout of v1.1.1 and override the Abot.Demo.Program.cs file 
with the one attached and then give it a run?

Original comment by sjdir...@gmail.com on 18 Dec 2013 at 9:27

Attachments:

GoogleCodeExporter commented 9 years ago
Okay, the problem was I was crawling a really slow website, however, with I 
crawl apple.com, it starts out fast but it slows down a lot by the 800th page.

Original comment by P...@stephendownward.ca on 19 Dec 2013 at 10:37

GoogleCodeExporter commented 9 years ago

Original comment by sjdir...@gmail.com on 30 Dec 2013 at 3:13

GoogleCodeExporter commented 9 years ago
I've encountered something similar where the crawl starts off quickly but then 
slows down to about 10x slower at around 1K pages or so. I'm troubleshooting 
this now but saw this article and it sounded similar. I was crawling 
www.seriouseats.com and ipython.org when I encountered this. Does anyone have 
any additional info on why this may be happening? I'm not sure at this point if 
it is the crawler itself or some rate limited that is being initiated by the 
target. 

Original comment by b...@luceomedia.com on 31 Jul 2014 at 3:05

GoogleCodeExporter commented 9 years ago
Hi, 

Its very likely that it is the site throttling or being overwhelmed.

A few things to try:

1: Run fiddler and monitor the time it takes for that site to individual 
requests. 
2: Open a browser on the same machine while it is running slow and request some 
of the urls that are taking a long time. If the browser is taking forever to 
pull up the page then its the server, not abot.

Hope that helps...
Steven

Original comment by sjdir...@gmail.com on 31 Jul 2014 at 4:39