Unspecified exception in System.Collections.Concurrent.ConcurrentDictionary

laurentprudhon commented 5 years ago

An exception is sometimes thrown in :

at System.Collections.Concurrent.ConcurrentDictionary`2.System.Collections.Generic.IDictionary<TKey,TValue>.Add(TKey key, TValue value) at nlptextdoc.extract.html.WebsiteTextExtractor.WebCrawler_ShouldCrawlPageLinks(CrawledPage crawledPage, CrawlContext crawlContext)

Example of websites to reproduce the bug :

https://www.probtp.com/ - after 91 pages https://www.arkea.com/ - after 502 pages, 526 pages https://www.lesechos.fr/finance-marches/ - after 7874 pages, 9804 pages, 17664 pages ...

laurentprudhon commented 5 years ago

Failed to reproduce - leaving this open : maybe the new and improved exceptions log files will give more clues.

laurentprudhon commented 5 years ago

"http://bourse.latribune.fr/"

Time | Pages | Errors | Unique | Download | Disk | Parsing | Convert | 2:15:00 | 11096 | 498 | 9 % | 4073,0 Mb | 147,8 Mb | 12:38:00 | 2:34:51 || Extraction stopped because the % of new textblocks fell below 10%

The key already existed in the dictionary. at System.Collections.Concurrent.ConcurrentDictionary`2.System.Collections.Generic.IDictionary<TKey,TValue>.Add(TKey key, TValue value) at nlptextdoc.extract.html.WebsiteTextExtractor.WebCrawler_ShouldCrawlPageLinks(CrawledPage crawledPage, CrawlContext crawlContext) in C:\Users\laure\OneDrive\Dev\C#\nlptextdoc\nlptextdoc.extract\html\WebsiteTextExtractor.cs:line 195

laurentprudhon commented 5 years ago

Analysis :

a page is open with a http://... url
this page contains absolute links in https:// and relative links (resolved to http://) to the same page
Abot tries to crawl the same URL twice : once with https and once with http
both pages are crawled almost at the same time
if the synchronization is wrong, because there is no lock on the dictionary, the code will try to insert the response twice in the cache

This exception is in fact not a problem, because the net result is that the page won't then be analyzed twice, which is good.

Solution : just ignore this exception in the exception handler.

laurentprudhon / nlptextdoc

Unspecified exception in System.Collections.Concurrent.ConcurrentDictionary #13