laurentprudhon / nlptextdoc

Suite of tools to extract and annotate language resources for NLP applications
Other
1 stars 2 forks source link

excludeUrls doesn't work as expected on continue #37

Open laurentprudhon opened 5 years ago

laurentprudhon commented 5 years ago

When continuing a crawl, excludeUrls directives are not applied to the Urls already added the the scheduler state : this renders the feature almost useless.

Also : each time we rewrite the config file when we continue the crawl process, the number of excludeUrls lines in the file is doubled !