issues
search
laurentprudhon
/
nlptextdoc
Suite of tools to extract and annotate language resources for NLP applications
Other
1
stars
2
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Bump Newtonsoft.Json from 12.0.3 to 13.0.2 in /nlptextdoc.image2
#45
dependabot[bot]
opened
1 year ago
0
Bump Newtonsoft.Json from 12.0.3 to 13.0.2 in /nlptextdoc.image
#44
dependabot[bot]
opened
1 year ago
0
Bump Newtonsoft.Json from 12.0.3 to 13.0.1 in /nlptextdoc.image2
#43
dependabot[bot]
closed
1 year ago
1
Bump Newtonsoft.Json from 12.0.3 to 13.0.1 in /nlptextdoc.image
#42
dependabot[bot]
closed
1 year ago
1
Impossible to pass a root Url with query parameters on the command line
#41
laurentprudhon
opened
5 years ago
0
Relative links resolution fails after redirect
#40
laurentprudhon
opened
5 years ago
0
When the continue command is used, don't reset the counters
#39
laurentprudhon
opened
5 years ago
0
For very long extractions, store a backup of the checkpoint every half hour
#38
laurentprudhon
opened
5 years ago
0
excludeUrls doesn't work as expected on continue
#37
laurentprudhon
opened
5 years ago
0
Register a trace of each run in the logs directory
#36
laurentprudhon
closed
5 years ago
0
List of Urls to exclude from the crawl
#35
laurentprudhon
closed
5 years ago
0
Move all diagnostic files to a specific subdirectory + trace launch params
#34
laurentprudhon
closed
5 years ago
0
Implement checkpoint and restart capability
#33
laurentprudhon
closed
5 years ago
0
Fix time measurements to exclude Hibernation and restrict to CPU time (vs elapsed)
#32
laurentprudhon
closed
5 years ago
3
Add three new stopping criterias : duration, size on disk, percent unique
#31
laurentprudhon
closed
5 years ago
0
Status codes like 302 Redirect or 404 NotFound should not be counted as errors in the status display
#30
laurentprudhon
closed
5 years ago
0
Calculate % of unique text blocks based on the number of chars
#29
laurentprudhon
closed
5 years ago
1
Anglesharp bug : fails to parse self closing iframe tag
#28
laurentprudhon
opened
5 years ago
1
Ignore pages with 0% unique text blocks
#27
laurentprudhon
closed
5 years ago
1
Invalid file path generated when URL contains % encoded char codes or multiple dots
#26
laurentprudhon
closed
5 years ago
2
Implement a fine grained "% of unique text blocks" stopping scheme
#25
laurentprudhon
closed
5 years ago
1
Enable choosing the scope of the extraction : domain, subdomain, url
#24
laurentprudhon
closed
5 years ago
1
Encoding name not supported in AngleSharp.Network.VirtualResponse.Content
#23
laurentprudhon
closed
5 years ago
0
UriFormatException in Abot.Core.PageRequester.MakeRequest
#22
laurentprudhon
closed
5 years ago
0
Headers nested inside list items trigger exceptions while writing the text document
#21
laurentprudhon
opened
5 years ago
0
For some pages, no text is extracted
#20
laurentprudhon
closed
5 years ago
1
Exclude pages based on the langage
#19
laurentprudhon
closed
5 years ago
1
Website crawling out of original domain
#18
laurentprudhon
closed
5 years ago
1
Page characters decoding problems
#17
laurentprudhon
closed
5 years ago
3
Limit file path size to 255 chars
#16
laurentprudhon
closed
5 years ago
2
Encode TextBlocks beginning by ## while writing a nlp.txt file
#15
laurentprudhon
closed
5 years ago
1
ParseInt32 exception at HtmlDocumentConverter.VisitTableHeaderOrCell
#14
laurentprudhon
closed
5 years ago
1
Unspecified exception in System.Collections.Concurrent.ConcurrentDictionary
#13
laurentprudhon
closed
5 years ago
3
Unspecified exception in System.IO.FileStream.ValidateFileHandle
#12
laurentprudhon
closed
5 years ago
1
Unspecified exception in WebCrawler_PageCrawlCompletedAsync
#11
laurentprudhon
closed
5 years ago
1
Incorrect display of the % of unique text blocks
#10
laurentprudhon
closed
5 years ago
1
Some websites are not extracted at all
#9
laurentprudhon
closed
5 years ago
6
Log all details about exceptions printed in the console
#8
laurentprudhon
closed
5 years ago
1
Robots.txt directives aren't followed as they should
#7
laurentprudhon
closed
5 years ago
1
Http logs are mingled when using multiple threads
#6
laurentprudhon
closed
5 years ago
1
Make sure that #anchors in Urls don't trigger the crawl of a new page
#5
laurentprudhon
closed
5 years ago
1
Http return code 301 "Moved perrmanently" and 303 "See Other" not handled properly
#4
laurentprudhon
closed
5 years ago
2
Url not encoded properly while crawling french website
#3
laurentprudhon
closed
5 years ago
1
Fatal exception while crawling : Index was outside the bounds of the array
#2
laurentprudhon
closed
5 years ago
1
Fatal exception while crawling : An item with the same key has already been added
#1
laurentprudhon
closed
5 years ago
1