Closed sbonello closed 4 years ago
Hi can you share what version of AbotX you are using and post some intialization/configuration and/or some log files (set to level debug)?
Hi,
I am using version 2.1.6. Crawler is stopping after the first page;
private static CrawlConfigurationX GetSafeConfig()
{
/*The following settings will help not get your ip banned
by the sites you are trying to crawl. The idea is to crawl
only 5 pages and wait 2 seconds between http requests
*/
return new CrawlConfigurationX
{
MaxPagesToCrawl = 100,
MinCrawlDelayPerDomainMilliSeconds = 2000
};
}
private static async Task DemoParallelCrawlerEngine()
{
var siteToCrawlProvider = new SiteToCrawlProvider();
var config = GetSafeConfig();
var crawlEngine = new ParallelCrawlerEngine(
config);
crawlEngine.Impls.SiteToCrawlProvider.AddSitesToCrawl(new
List
await crawlEngine.StartAsync();
}
On Sat, Apr 25, 2020 at 5:58 AM Steven notifications@github.com wrote:
Hi can you share what version of AbotX you are using and post some intialization/configuration and/or some log files (set to level debug)?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sjdirect/abotx/issues/22#issuecomment-619315407, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7XTOGPBVJT7FQNZSISCALROJNU5ANCNFSM4MJKM5WA .
Try initializing like the following and see if that helps...
var crawlEngine = new ParallelCrawlerEngine( config, new ParallelImplementationOverride(config) { SiteToCrawlProvider = siteToCrawlProvider; });
var crawlEngine = new ParallelCrawlerEngine(
config,
new ParallelImplementationOverride(config,
new ParallelImplementationContainer()
{
SiteToCrawlProvider = siteToCrawlProvider,
WebCrawlerFactory = new WebCrawlerFactory(config)//Same config will be used for every crawler
})
);
On Mon, Apr 27, 2020, 2:46 AM sbonello notifications@github.com wrote:
Hi,
I am using version 2.1.6. Crawler is stopping after the first page;
private static CrawlConfigurationX GetSafeConfig() { /The following settings will help not get your ip banned by the sites you are trying to crawl. The idea is to crawl only 5 pages and wait 2 seconds between http requests / return new CrawlConfigurationX { MaxPagesToCrawl = 100, MinCrawlDelayPerDomainMilliSeconds = 2000 }; }
private static async Task DemoParallelCrawlerEngine() { var siteToCrawlProvider = new SiteToCrawlProvider();
var config = GetSafeConfig();
var crawlEngine = new ParallelCrawlerEngine( config);
crawlEngine.Impls.SiteToCrawlProvider.AddSitesToCrawl(new List
{ new SiteToCrawl{ Uri = new Uri("https://chubbydeveloper.com"), Id = Guid.NewGuid() }, }); await crawlEngine.StartAsync(); }
On Sat, Apr 25, 2020 at 5:58 AM Steven notifications@github.com wrote:
Hi can you share what version of AbotX you are using and post some intialization/configuration and/or some log files (set to level debug)?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sjdirect/abotx/issues/22#issuecomment-619315407, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AD7XTOGPBVJT7FQNZSISCALROJNU5ANCNFSM4MJKM5WA
.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sjdirect/abotx/issues/22#issuecomment-619865118, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA5C3YT7WQJTG43LK4OWIE3ROVH7HANCNFSM4MJKM5WA .
worse it didn't even crawl one site
output.
2020-04-27 23:27:56:484 +02:00] [1] [INF] - Started [1] thread for monitoring [2020-04-27 23:27:56:486 +02:00] [7] [INF] - Engine is still running... [2020-04-27 23:27:56:518 +02:00] [1] [INF] - Started [1] thread running ISiteToCrawlProducer of type [AbotX2.Parallel.SiteToCrawlProducer] [2020-04-27 23:27:56:525 +02:00] [8] [INF] - Retrieving up to [5] sites to crawl [2020-04-27 23:27:56:529 +02:00] [1] [INF] - Started [2] threads running ISiteToCrawlConsumer of type [AbotX2.Parallel.SiteToCrawlConsumer] [2020-04-27 23:27:56:536 +02:00] [8] [INF] - Retrieved [0] sites to crawl [2020-04-27 23:27:56:540 +02:00] [8] [INF] - ISiteToCrawlProvider [AbotX2.Parallel.SiteToCrawlProvider] is reporting that it is complete. Will not make anymore requests for sites to crawl. [2020-04-27 23:27:57:674 +02:00] [13] [INF] - All ISiteToCrawlConsumer threads have completed [2020-04-27 23:27:57:680 +02:00] [11] [INF] - All ISiteToCrawlProducer threads have completed [2020-04-27 23:27:57:684 +02:00] [13] [INF] - All crawls have completed
On Mon, Apr 27, 2020 at 5:28 PM Steven notifications@github.com wrote:
Try initializing like the following and see if that helps...
var crawlEngine = new ParallelCrawlerEngine( config, new ParallelImplementationOverride(config) { SiteToCrawlProvider = siteToCrawlProvider; });
On Mon, Apr 27, 2020, 2:46 AM sbonello notifications@github.com wrote:
Hi,
I am using version 2.1.6. Crawler is stopping after the first page;
private static CrawlConfigurationX GetSafeConfig() { /The following settings will help not get your ip banned by the sites you are trying to crawl. The idea is to crawl only 5 pages and wait 2 seconds between http requests / return new CrawlConfigurationX { MaxPagesToCrawl = 100, MinCrawlDelayPerDomainMilliSeconds = 2000 }; }
private static async Task DemoParallelCrawlerEngine() { var siteToCrawlProvider = new SiteToCrawlProvider();
var config = GetSafeConfig();
var crawlEngine = new ParallelCrawlerEngine( config);
crawlEngine.Impls.SiteToCrawlProvider.AddSitesToCrawl(new List
{ new SiteToCrawl{ Uri = new Uri("https://chubbydeveloper.com"), Id = Guid.NewGuid() }, }); await crawlEngine.StartAsync(); }
On Sat, Apr 25, 2020 at 5:58 AM Steven notifications@github.com wrote:
Hi can you share what version of AbotX you are using and post some intialization/configuration and/or some log files (set to level debug)?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sjdirect/abotx/issues/22#issuecomment-619315407, or unsubscribe <
https://github.com/notifications/unsubscribe-auth/AD7XTOGPBVJT7FQNZSISCALROJNU5ANCNFSM4MJKM5WA
.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sjdirect/abotx/issues/22#issuecomment-619865118, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AA5C3YT7WQJTG43LK4OWIE3ROVH7HANCNFSM4MJKM5WA
.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sjdirect/abotx/issues/22#issuecomment-620058012, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7XTOCNX5F3JAWM4HEQ5QLROWQCLANCNFSM4MJKM5WA .
So there are 2 issues here, neither are your fault.
1) The docs and my example suggestion above were incorrect syntax. I updated the docs and that comment above. Notice the ParallelImplementationContainer as a parameter to ParallelImplementationOverride.
2) There was a bug in the code that should now be fixed in version 2.1.7.
Let me know if that fixes your issues.
Thanks
Will test as soon as the new version is out
Simon
On Mon, May 4, 2020 at 1:59 AM Steven notifications@github.com wrote:
So there are 2 issues here, neither are your fault.
1.
The docs and my example suggestion above were incorrect syntax. I updated the docs and that comment above. Notice the ParallelImplementationContainer as a parameter to ParallelImplementationOverride. 2.
There was a bug in the code that should now be fixed in version 2.1.7.
Let me know if that fixes your issues.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sjdirect/abotx/issues/22#issuecomment-623205263, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7XTOFZL56KUR7Q2WRYODLRPYANZANCNFSM4MJKM5WA .
Its out there whenever you are ready.
Closing issue. Please reopen issue if problem persists.
I am trying to test the parallel engine but it is not working . it is returning after the first page crawl. I am testing using with a license. Would consider to upgrade if it works