sjdirect / abot

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Apache License 2.0
2.24k stars 557 forks source link

AccessViolationException thrown while crawling #188

Closed SchwarzChristian closed 6 years ago

SchwarzChristian commented 6 years ago

Hi,

while crawling https://www.ikk-classic.de/ the following exception is thrown and the process crashes immediately:

An unhandled exception of type 'System.AccessViolationException' occurred in System.dll
   at System.Net.UnsafeNclNativeMethods.OSSOCK.recv(IntPtr socketHandle, Byte* pinnedBuffer, Int32 len, SocketFlags socketFlags)
   at System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags, SocketError& errorCode)
   at System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags)
   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)
   at System.Net.FixedSizeReader.ReadPacket(Byte[] buffer, Int32 offset, Int32 count)
   at System.Net.Security._SslStream.StartFrameBody(Int32 readBytes, Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security._SslStream.StartFrameHeader(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security._SslStream.StartReading(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security._SslStream.ProcessRead(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.TlsStream.Read(Byte[] buffer, Int32 offset, Int32 size)
   at System.Net.PooledStream.Read(Byte[] buffer, Int32 offset, Int32 size)
   at System.Net.Connection.SyncRead(HttpWebRequest request, Boolean userRetrievedStream, Boolean probeRead)
   at System.Net.ConnectStream.ProcessWriteCallDone(ConnectionReturnResult returnResult)
   at System.Net.ConnectStream.CallDone(ConnectionReturnResult returnResult)
   at System.Net.ConnectStream.CloseInternal(Boolean internalCall, Boolean aborting)
   at System.Net.ConnectStream.System.Net.ICloseEx.CloseEx(CloseExState closeState)
   at System.Net.HttpWebRequest.EndWriteHeaders_Part2()
   at System.Net.HttpWebRequest.EndWriteHeaders(Boolean async)
   at System.Net.HttpWebRequest.WriteHeadersCallback(WebExceptionStatus errorStatus, ConnectStream stream, Boolean async)
   at System.Net.ConnectStream.WriteHeaders(Boolean async)
   at System.Net.HttpWebRequest.EndSubmitRequest()
   at System.Net.Connection.CompleteConnection(Boolean async, HttpWebRequest request)
   at System.Net.Connection.CompleteStartConnection(Boolean async, HttpWebRequest httpWebRequest)
   at System.Net.Connection.CompleteStartRequest(Boolean onSubmitThread, HttpWebRequest request, TriState needReConnect)
   at System.Net.Connection.SubmitRequest(HttpWebRequest request, Boolean forcedsubmit)
   at System.Net.ServicePoint.SubmitRequest(HttpWebRequest request, String connName)
   at System.Net.HttpWebRequest.SubmitRequest(ServicePoint servicePoint)
   at System.Net.HttpWebRequest.GetResponse()
   at Abot.Core.PageRequester.MakeRequest(Uri uri, Func`2 shouldDownloadContent) in D:\Documents\Projects\abot\Abot\Core\PageRequester.cs:line 87
   at Abot.Crawler.WebCrawler.CrawlThePage(PageToCrawl pageToCrawl) in D:\Documents\Projects\abot\Abot\Crawler\WebCrawler.cs:line 884
   at AbotX.Crawler.CrawlerX.CrawlThePage(PageToCrawl pageToCrawl) in D:\Documents\Projects\abotx\AbotX\Crawler\CrawlerX.cs:line 206
   at Abot.Crawler.WebCrawler.ProcessPage(PageToCrawl pageToCrawl) in D:\Documents\Projects\abot\Abot\Crawler\WebCrawler.cs:line 671
   at Abot.Crawler.WebCrawler.<CrawlSite>b__68_0() in D:\Documents\Projects\abot\Abot\Crawler\WebCrawler.cs:line 539
   at Abot.Util.ThreadManager.RunAction(Action action, Boolean decrementRunningThreadCountOnCompletion) in D:\Documents\Projects\abot\Abot\Util\ThreadManager.cs:line 113
   at Abot.Util.TaskThreadManager.<>c__DisplayClass5_0.<RunActionOnDedicatedThread>b__0() in D:\Documents\Projects\abot\Abot\Util\TaskThreadManager.cs:line 43
   at System.Threading.Tasks.Task.Execute()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot)
   at System.Threading.Tasks.Task.ExecuteEntry(Boolean bPreventDoubleExecution)
   at System.Threading.ThreadPoolWorkQueue.Dispatch()

The problem occurs after some time, depending on the politeness settings. After some investigation using Wireshark and according to the stack trace, it seems like the exception is triggered by some invalid packets sent by the crawled webserver. May it be possible to handle this kind of exceptions?

sjdirect commented 6 years ago

Google is your friend on this one... "handling System.AccessViolationException" should give you some options.

SchwarzChristian commented 6 years ago

I did, but none of them are working in this case, because the exception is thrown in a thread started by Abot/AbotX. I dont want to catch corrupted state exceptions globally. To catch them locally, i need to annotate the catching method. Since none of my methods is in the stack trace i see no way to do so.

sjdirect commented 6 years ago

You can take a look at Abot.Util.TaskThreadManager.cs implementation. There is some basic exception handling happening there that you might be able to alter to do what you are trying to do. If you extend/override and plug in your imple IThreadManager you can hook into it.

SchwarzChristian commented 6 years ago

I'll take a look at this next week. Thank you for your help.