Closed ghost closed 7 years ago
Hi Cathal,
Thanks so much for getting in touch and great points. As coincidence would have it, I was actually talking to a friend about this last night following his experiences using NLTK for NLP where substitutional tokenisation strategies are available. I had started work on some of the changes you suggested so following your email went ahead and made some more. Really great suggestions. I have committed the changes so please feel free to take a look and let me know your thoughts. Thanks again for getting in touch and if you have any other suggestions or want to make Pull Requests please do let me know.
Thanks again,
James
Hi James,
Would you be open to a PR that made some changes to the Tokeniser type, and to dependent types, to allow for custom Tokenisers? This would make
nlp
more general for different languages, or for handling different tokenisation strategies.What I'm imagining is this (note, the workflow is designed to avoid breaking API changes):
Tokeniser
to an interface, providingForEachIn
andTokenise
methods.NewTokeniser
to a method that returns a default implementation, which would be identical to the current implementation.NewCustomTokeniser(tokenPattern string, stopWordList []string) *Tokeniser
, which would enable easy creation of a custom tokeniser.CountVectoriser
andHashingVectoriser
to allow use of a customTokeniser
, OR (your preference) make theirvec.tokeniser
field into a public fieldvec.Tokeniser
, allowing overrides or manual construction of either.I could probably make the required changes quickly enough, if you're interested. :)