Closed replaysMike closed 2 years ago
Problem solved, not an issue with the library. The UserAgent I was using must have some special blocking on this website. I was using Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
and it just black hole's the connection.
Thanks for the follow up.
I've tried the following code as an example but I always get a timeout from this particular URL (I've tried other URLs that work fine). The URL being requested returns a 304, Not Modified and I checked with Curl and it gets the body content no problem and there don't appear to be any redirects. The
PageCrawlDisallowed
event is triggered with a reason:Page has no content
, but thePageCrawlCompleted
event shows a failure/timeout trying to read the body content.e.CrawledPage.HttpRequestException:
exception occurred with the originating request: 'Request timeout occurred The request was canceled due to the configured HttpClient.Timeout of 60 seconds elapsing. The operation was canceled
Is this a bug with processing certain URLs or am I missing something silly?