sjdirect / abot

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Apache License 2.0
2.25k stars 560 forks source link

Are 304 responses properly handled? #235

Closed replaysMike closed 2 years ago

replaysMike commented 2 years ago

I've tried the following code as an example but I always get a timeout from this particular URL (I've tried other URLs that work fine). The URL being requested returns a 304, Not Modified and I checked with Curl and it gets the body content no problem and there don't appear to be any redirects. The PageCrawlDisallowed event is triggered with a reason: Page has no content, but the PageCrawlCompleted event shows a failure/timeout trying to read the body content.

e.CrawledPage.HttpRequestException: exception occurred with the originating request: 'Request timeout occurred The request was canceled due to the configured HttpClient.Timeout of 60 seconds elapsing. The operation was canceled

// ,net 6 console app
var config = new CrawlConfiguration
{
  HttpRequestTimeoutInSeconds = 60,
  CrawlTimeoutSeconds = 60,
  MaxConcurrentThreads = 10,
};

using var crawler = new PoliteWebCrawler(config);
crawler.PageCrawlCompleted += PageCrawlCompleted;
crawler.PageCrawlDisallowed += PageCrawlDisallowed;
var uri = new Uri("https://www.ti.com/amplifier-circuit/current-sense/analog-output/overview.html");
var crawlResult = await crawler.CrawlAsync(uri);

private void PageCrawlCompleted(object? sender, PageCrawlCompletedArgs e)
{
  var httpStatus = e.CrawledPage.HttpResponseMessage?.StatusCode;
  var rawPageText = e.CrawledPage.Content?.Text;
  // e.CrawledPage.HttpResonseMessage is null, the HttpRequestException is set
}

Is this a bug with processing certain URLs or am I missing something silly?

replaysMike commented 2 years ago

Problem solved, not an issue with the library. The UserAgent I was using must have some special blocking on this website. I was using Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) and it just black hole's the connection.

sjdirect commented 2 years ago

Thanks for the follow up.