Currently there is only one way to provide the linguistic resources (e.g. non-breaker files) to the tokenizer: packaging the files with the tokenizer jar.
I currently work on the Fortis project [1], a social data analysis platform for the United Nations. For this project, we'd like the ability to update the linguistic resources without re-deploying a new jar file. As such, we need a way to provide the linguistic resources to the tokenizer without changing the jar. This pull request implements such a mechanism by adding a new property called resourcesDirectory. If this property is set to an existing directory, the tokenizer will try to load the linguistic resources from this directory instead of from the jar file.
Another nice property of this change is that it'll make it easier for users to comply with the terms of the license of the linguistic resources, the LGPL-LR [2], as we no longer need to bundle the LGPL-LR resources together with the tokenizer code which means that the application will count as a "work that uses the Linguistic Resource" and as such fall outside the scope of the license.
Currently there is only one way to provide the linguistic resources (e.g. non-breaker files) to the tokenizer: packaging the files with the tokenizer jar.
I currently work on the Fortis project [1], a social data analysis platform for the United Nations. For this project, we'd like the ability to update the linguistic resources without re-deploying a new jar file. As such, we need a way to provide the linguistic resources to the tokenizer without changing the jar. This pull request implements such a mechanism by adding a new property called
resourcesDirectory
. If this property is set to an existing directory, the tokenizer will try to load the linguistic resources from this directory instead of from the jar file.Another nice property of this change is that it'll make it easier for users to comply with the terms of the license of the linguistic resources, the LGPL-LR [2], as we no longer need to bundle the LGPL-LR resources together with the tokenizer code which means that the application will count as a "work that uses the Linguistic Resource" and as such fall outside the scope of the license.
[1] https://fortis-web.azurewebsites.net/#/site/ocha/ [2] http://infolingu.univ-mlv.fr/DonneesLinguistiques/Lexiques-Grammaires/lgpllr.html