Interface for Tokeniser, Allow Custom Tokenisers?

james-bowman / nlp

Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang

MIT License

449 stars 45 forks source link

Hi James,

Would you be open to a PR that made some changes to the Tokeniser type, and to dependent types, to allow for custom Tokenisers? This would make nlp more general for different languages, or for handling different tokenisation strategies.

What I'm imagining is this (note, the workflow is designed to avoid breaking API changes):

Convert Tokeniser to an interface, providing ForEachIn and Tokenise methods.
Convert NewTokeniser to a method that returns a default implementation, which would be identical to the current implementation.
Add a new method, NewCustomTokeniser(tokenPattern string, stopWordList []string) *Tokeniser, which would enable easy creation of a custom tokeniser.
Add new constructors for CountVectoriser and HashingVectoriser to allow use of a custom Tokeniser, OR (your preference) make their vec.tokeniser field into a public field vec.Tokeniser, allowing overrides or manual construction of either.

I could probably make the required changes quickly enough, if you're interested. :)

james-bowman / nlp

Interface for Tokeniser, Allow Custom Tokenisers? #3