Open CalebCourier opened 3 months ago
Using the existing tokenizer as a default and allowing overrides means adding nltk as a dependency and downloading punkt in the post-install. It also means adding the sentence splitting method to the RedundantSentences class so the user can override it.
Currently this validator uses NLTK's Punkt tokenizer for splitting sentences in the the value passed to the
validate
function. This works well for english and possible other natural languages but falls short if we consider other blocks of text we might want to dedupe like code.One way to enable this is to keep punkt as a default, but all the user to override the sentence splitter with whatever tokenizer works best for their use case.