sjdirect / abotx

Cross Platform C# Web crawler framework, headless browser, parallel crawler. Please star this project! +1.
https://abotx.org
131 stars 23 forks source link

Parallel engine not working #22

Closed sbonello closed 4 years ago

sbonello commented 4 years ago

I am trying to test the parallel engine but it is not working . it is returning after the first page crawl. I am testing using with a license. Would consider to upgrade if it works

sjdirect commented 4 years ago

Hi can you share what version of AbotX you are using and post some intialization/configuration and/or some log files (set to level debug)?

sbonello commented 4 years ago

Hi,

I am using version 2.1.6. Crawler is stopping after the first page;

    private static CrawlConfigurationX GetSafeConfig()
    {
        /*The following settings will help not get your ip banned
         by the sites you are trying to crawl. The idea is to crawl
         only 5 pages and wait 2 seconds between http requests
         */
        return new CrawlConfigurationX
        {
            MaxPagesToCrawl = 100,
            MinCrawlDelayPerDomainMilliSeconds = 2000
        };
    }

    private static async Task DemoParallelCrawlerEngine()
    {
        var siteToCrawlProvider = new SiteToCrawlProvider();

        var config = GetSafeConfig();

        var crawlEngine = new ParallelCrawlerEngine(
            config);

        crawlEngine.Impls.SiteToCrawlProvider.AddSitesToCrawl(new

List { new SiteToCrawl{ Uri = new Uri("https://chubbydeveloper.com"), Id = Guid.NewGuid() }, });

        await crawlEngine.StartAsync();
    }

On Sat, Apr 25, 2020 at 5:58 AM Steven notifications@github.com wrote:

Hi can you share what version of AbotX you are using and post some intialization/configuration and/or some log files (set to level debug)?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sjdirect/abotx/issues/22#issuecomment-619315407, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7XTOGPBVJT7FQNZSISCALROJNU5ANCNFSM4MJKM5WA .

sjdirect commented 4 years ago

Try initializing like the following and see if that helps...

var crawlEngine = new ParallelCrawlerEngine( config, new ParallelImplementationOverride(config) { SiteToCrawlProvider = siteToCrawlProvider; });

        var crawlEngine = new ParallelCrawlerEngine(
            config, 
            new ParallelImplementationOverride(config, 
                new ParallelImplementationContainer()
                {
                    SiteToCrawlProvider = siteToCrawlProvider,
                    WebCrawlerFactory = new WebCrawlerFactory(config)//Same config will be used for every crawler
                })
            );

On Mon, Apr 27, 2020, 2:46 AM sbonello notifications@github.com wrote:

Hi,

I am using version 2.1.6. Crawler is stopping after the first page;

private static CrawlConfigurationX GetSafeConfig() { /The following settings will help not get your ip banned by the sites you are trying to crawl. The idea is to crawl only 5 pages and wait 2 seconds between http requests / return new CrawlConfigurationX { MaxPagesToCrawl = 100, MinCrawlDelayPerDomainMilliSeconds = 2000 }; }

private static async Task DemoParallelCrawlerEngine() { var siteToCrawlProvider = new SiteToCrawlProvider();

var config = GetSafeConfig();

var crawlEngine = new ParallelCrawlerEngine( config);

crawlEngine.Impls.SiteToCrawlProvider.AddSitesToCrawl(new List { new SiteToCrawl{ Uri = new Uri("https://chubbydeveloper.com"), Id = Guid.NewGuid() }, });

await crawlEngine.StartAsync(); }

On Sat, Apr 25, 2020 at 5:58 AM Steven notifications@github.com wrote:

Hi can you share what version of AbotX you are using and post some intialization/configuration and/or some log files (set to level debug)?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sjdirect/abotx/issues/22#issuecomment-619315407, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AD7XTOGPBVJT7FQNZSISCALROJNU5ANCNFSM4MJKM5WA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sjdirect/abotx/issues/22#issuecomment-619865118, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA5C3YT7WQJTG43LK4OWIE3ROVH7HANCNFSM4MJKM5WA .

sbonello commented 4 years ago

worse it didn't even crawl one site

output.

2020-04-27 23:27:56:484 +02:00] [1] [INF] - Started [1] thread for monitoring [2020-04-27 23:27:56:486 +02:00] [7] [INF] - Engine is still running... [2020-04-27 23:27:56:518 +02:00] [1] [INF] - Started [1] thread running ISiteToCrawlProducer of type [AbotX2.Parallel.SiteToCrawlProducer] [2020-04-27 23:27:56:525 +02:00] [8] [INF] - Retrieving up to [5] sites to crawl [2020-04-27 23:27:56:529 +02:00] [1] [INF] - Started [2] threads running ISiteToCrawlConsumer of type [AbotX2.Parallel.SiteToCrawlConsumer] [2020-04-27 23:27:56:536 +02:00] [8] [INF] - Retrieved [0] sites to crawl [2020-04-27 23:27:56:540 +02:00] [8] [INF] - ISiteToCrawlProvider [AbotX2.Parallel.SiteToCrawlProvider] is reporting that it is complete. Will not make anymore requests for sites to crawl. [2020-04-27 23:27:57:674 +02:00] [13] [INF] - All ISiteToCrawlConsumer threads have completed [2020-04-27 23:27:57:680 +02:00] [11] [INF] - All ISiteToCrawlProducer threads have completed [2020-04-27 23:27:57:684 +02:00] [13] [INF] - All crawls have completed

On Mon, Apr 27, 2020 at 5:28 PM Steven notifications@github.com wrote:

Try initializing like the following and see if that helps...

var crawlEngine = new ParallelCrawlerEngine( config, new ParallelImplementationOverride(config) { SiteToCrawlProvider = siteToCrawlProvider; });

On Mon, Apr 27, 2020, 2:46 AM sbonello notifications@github.com wrote:

Hi,

I am using version 2.1.6. Crawler is stopping after the first page;

private static CrawlConfigurationX GetSafeConfig() { /The following settings will help not get your ip banned by the sites you are trying to crawl. The idea is to crawl only 5 pages and wait 2 seconds between http requests / return new CrawlConfigurationX { MaxPagesToCrawl = 100, MinCrawlDelayPerDomainMilliSeconds = 2000 }; }

private static async Task DemoParallelCrawlerEngine() { var siteToCrawlProvider = new SiteToCrawlProvider();

var config = GetSafeConfig();

var crawlEngine = new ParallelCrawlerEngine( config);

crawlEngine.Impls.SiteToCrawlProvider.AddSitesToCrawl(new List { new SiteToCrawl{ Uri = new Uri("https://chubbydeveloper.com"), Id = Guid.NewGuid() }, });

await crawlEngine.StartAsync(); }

On Sat, Apr 25, 2020 at 5:58 AM Steven notifications@github.com wrote:

Hi can you share what version of AbotX you are using and post some intialization/configuration and/or some log files (set to level debug)?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sjdirect/abotx/issues/22#issuecomment-619315407, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AD7XTOGPBVJT7FQNZSISCALROJNU5ANCNFSM4MJKM5WA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sjdirect/abotx/issues/22#issuecomment-619865118, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AA5C3YT7WQJTG43LK4OWIE3ROVH7HANCNFSM4MJKM5WA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sjdirect/abotx/issues/22#issuecomment-620058012, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7XTOCNX5F3JAWM4HEQ5QLROWQCLANCNFSM4MJKM5WA .

sjdirect commented 4 years ago

So there are 2 issues here, neither are your fault.

1) The docs and my example suggestion above were incorrect syntax. I updated the docs and that comment above. Notice the ParallelImplementationContainer as a parameter to ParallelImplementationOverride.

2) There was a bug in the code that should now be fixed in version 2.1.7.

Let me know if that fixes your issues.

sbonello commented 4 years ago

Thanks

Will test as soon as the new version is out

Simon

On Mon, May 4, 2020 at 1:59 AM Steven notifications@github.com wrote:

So there are 2 issues here, neither are your fault.

1.

The docs and my example suggestion above were incorrect syntax. I updated the docs and that comment above. Notice the ParallelImplementationContainer as a parameter to ParallelImplementationOverride. 2.

There was a bug in the code that should now be fixed in version 2.1.7.

Let me know if that fixes your issues.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sjdirect/abotx/issues/22#issuecomment-623205263, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7XTOFZL56KUR7Q2WRYODLRPYANZANCNFSM4MJKM5WA .

sjdirect commented 4 years ago

Its out there whenever you are ready.

sjdirect commented 4 years ago

Closing issue. Please reopen issue if problem persists.