laurentprudhon / nlptextdoc

Suite of tools to extract and annotate language resources for NLP applications
Other
1 stars 2 forks source link

Unspecified exception in System.Collections.Concurrent.ConcurrentDictionary #13

Closed laurentprudhon closed 5 years ago

laurentprudhon commented 5 years ago

An exception is sometimes thrown in :

at System.Collections.Concurrent.ConcurrentDictionary`2.System.Collections.Generic.IDictionary<TKey,TValue>.Add(TKey key, TValue value) at nlptextdoc.extract.html.WebsiteTextExtractor.WebCrawler_ShouldCrawlPageLinks(CrawledPage crawledPage, CrawlContext crawlContext)

Example of websites to reproduce the bug :

https://www.probtp.com/ - after 91 pages https://www.arkea.com/ - after 502 pages, 526 pages https://www.lesechos.fr/finance-marches/ - after 7874 pages, 9804 pages, 17664 pages ...

laurentprudhon commented 5 years ago

Failed to reproduce - leaving this open : maybe the new and improved exceptions log files will give more clues.

laurentprudhon commented 5 years ago

"http://bourse.latribune.fr/"

Time | Pages | Errors | Unique | Download | Disk | Parsing | Convert | 2:15:00 | 11096 | 498 | 9 % | 4073,0 Mb | 147,8 Mb | 12:38:00 | 2:34:51 || Extraction stopped because the % of new textblocks fell below 10%

Error while parsing the page https://www.latribune.fr/opinions/tribunes/due-diligence-les-enseignements-de-l-affaire-sonepar-824781.html Error while parsing the page https://marseille.latribune.fr/entreprises-finance/2019-08-26/comment-portsynergy-projects-s-est-impose-a-fos-sur-le-conteneur-826416.html Error while parsing the page https://www.latribune.fr/technos-medias/publicite/banque-populaire-30-ans-de-sponsoring-dans-la-voile-825445.html Error while parsing the page https://www.latribune.fr/technos-medias/publicite/charal-un-partenariat-jeune-ou-tout-est-a-construire-825442.html Error while parsing the page https://www.latribune.fr/technos-medias/publicite/sodebo-en-course-pour-la-preference-de-marque-825441.html Error while parsing the page https://www.latribune.fr/economie/france/jo-2024-paris-decroche-enfin-l-or-750078.html Error while parsing the page https://www.latribune.fr/mon-compte/inscription Error while parsing the page https://www.latribune.fr/mon-compte/inscription Error while parsing the page https://acteursdeleconomie.latribune.fr/mon-compte/inscription Error while parsing the page https://www.latribune.fr/finance-patrimoine-investir.html

The key already existed in the dictionary. at System.Collections.Concurrent.ConcurrentDictionary`2.System.Collections.Generic.IDictionary<TKey,TValue>.Add(TKey key, TValue value) at nlptextdoc.extract.html.WebsiteTextExtractor.WebCrawler_ShouldCrawlPageLinks(CrawledPage crawledPage, CrawlContext crawlContext) in C:\Users\laure\OneDrive\Dev\C#\nlptextdoc\nlptextdoc.extract\html\WebsiteTextExtractor.cs:line 195

laurentprudhon commented 5 years ago

Analysis :

This exception is in fact not a problem, because the net result is that the page won't then be analyzed twice, which is good.

Solution : just ignore this exception in the exception handler.