Allow User to Override Tokenizer

guardrails-ai / redundant_sentences

Guardrails AI: Redundant sentences - Removes redundant sentences from a string

Apache License 2.0

2 stars 1 forks source link

Allow User to Override Tokenizer #3

Open CalebCourier opened 3 months ago

CalebCourier commented 3 months ago

Currently this validator uses NLTK's Punkt tokenizer for splitting sentences in the the value passed to the validate function. This works well for english and possible other natural languages but falls short if we consider other blocks of text we might want to dedupe like code.

One way to enable this is to keep punkt as a default, but all the user to override the sentence splitter with whatever tokenizer works best for their use case.

CalebCourier commented 3 months ago

Using the existing tokenizer as a default and allowing overrides means adding nltk as a dependency and downloading punkt in the post-install. It also means adding the sentence splitting method to the RedundantSentences class so the user can override it.