laurentprudhon / nlptextdoc

Suite of tools to extract and annotate language resources for NLP applications
Other
1 stars 2 forks source link

Invalid file path generated when URL contains % encoded char codes or multiple dots #26

Closed laurentprudhon closed 5 years ago

laurentprudhon commented 5 years ago

La syntaxe du nom de fichier, de répertoire ou de volume est incorrecte : 'C:\Users\laure\Desktop\nlptextdoc-data-201908\fr.wikipedia.org\wiki\Cat%C3%A9gorie:Portail:Finance' at System.IO.FileSystem.CreateDirectory(String fullPath) at nlptextdoc.extract.html.WebsiteTextExtractor.WebCrawler_PageCrawlCompletedAsync(Object sender, PageCrawlCompletedArgs e) in C:\Users\laure\OneDrive\Dev\C#\nlptextdoc\nlptextdoc.extract\html\WebsiteTextExtractor.cs:line 445

laurentprudhon commented 5 years ago

Error while processing the page : https://fr.wikipedia.org/wiki/Cat%C3%A9gorie:Portail:Entreprises/Articles_li%C3%A9s

La syntaxe du nom de fichier, de répertoire ou de volume est incorrecte : 'C:\Users\laure\Desktop\nlptextdoc-data-201908\fr.wikipedia.org\wiki\Cat%C3%A9gorie:Portail:Entreprises' at System.IO.FileSystem.CreateDirectory(String fullPath) at nlptextdoc.extract.html.WebsiteTextExtractor.WebCrawler_PageCrawlCompletedAsync(Object sender, PageCrawlCompletedArgs e) in C:\Users\laure\OneDrive\Dev\C#\nlptextdoc\nlptextdoc.extract\html\WebsiteTextExtractor.cs:line 445

laurentprudhon commented 5 years ago

Error while processing the page : https://fr.wikipedia.org/wiki/Voir_Venise.../...Et_mourir

Could not find file 'C:\Users\laure\Desktop\nlptextdoc-data-201908\fr.wikipedia.org\wiki\Voir_Venise......Et_mourir.nlp.txt'. at System.IO.FileInfo.get_Length() at nlptextdoc.extract.html.WebsiteTextExtractor.WebCrawler_PageCrawlCompletedAsync(Object sender, PageCrawlCompletedArgs e) in C:\Users\laure\OneDrive\Dev\C#\nlptextdoc\nlptextdoc.extract\html\WebsiteTextExtractor.cs:line 445