sjdirect / abot

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Apache License 2.0
2.23k stars 556 forks source link

Cancellation is not working properly and integration test is wrong #240

Open kpiekara opened 12 months ago

kpiekara commented 12 months ago

I believe it is the same issue as https://github.com/sjdirect/abot/issues/206 which was closed based on "integration unittest is passing".

This UT is passing:

[Test]
public async Task Crawl_Synchronous_CancellationTokenCancelled_StopsCrawl()
{
    var cancellationTokenSource = new CancellationTokenSource();
    var timer = new System.Timers.Timer(800);
    timer.Elapsed += (o, e) =>
    {
        cancellationTokenSource.Cancel();
        timer.Stop();
        timer.Dispose();
    };
    timer.Start();

    var crawler = new PoliteWebCrawler();
    var result = await crawler.CrawlAsync(new Uri("https://github.com/"), cancellationTokenSource);

    Assert.IsTrue(result.ErrorOccurred);
    Assert.IsTrue(result.ErrorException is OperationCanceledException);
}

But if we change time (from 800ms to 3s) to actually crawler starting to work:

[Test]
public async Task Crawl_Synchronous_CancellationTokenCancelled_StopsCrawl()
{
    var cancellationTokenSource = new CancellationTokenSource();
    var timer = new System.Timers.Timer(3000);
    timer.Elapsed += (o, e) =>
    {
        cancellationTokenSource.Cancel();
        timer.Stop();
        timer.Dispose();
    };
    timer.Start();

    var crawler = new PoliteWebCrawler();
    var result = await crawler.CrawlAsync(new Uri("https://github.com/"), cancellationTokenSource);

    Assert.IsTrue(result.ErrorOccurred);
    Assert.IsTrue(result.ErrorException is OperationCanceledException);
}

We have failure which will crash application as unhandled exception

Exit code is -532462766 (Output is too long. Showing the last 100 lines:

   at System.Threading.CancellationToken.ThrowIfCancellationRequested()
   at Abot2.Crawler.WebCrawler.ThrowIfCancellationRequested()
   at Abot2.Crawler.WebCrawler.ProcessPage(PageToCrawl pageToCrawl)
   at Abot2.Crawler.WebCrawler.<CrawlSite>b__64_0()
   at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__128_1(Object state)
   at System.Threading.QueueUserWorkItemCallback.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()

Issue: there is no way to cancel crawler

ynnob commented 10 months ago

Same here. Entire Website is crashing when crawler is gettign canceled.