laurentprudhon / nlptextdoc

Suite of tools to extract and annotate language resources for NLP applications
Other
1 stars 2 forks source link

List of Urls to exclude from the crawl #35

Closed laurentprudhon closed 5 years ago

laurentprudhon commented 5 years ago

Take advantage of the checkpoint/restart capability, and of the new params trace file, to add the following important feature :

Read a list of Urls ro exclude from the crawl.

(we should reuse the robots exclusion engine)