james-bowman / nlp

Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang
MIT License
449 stars 45 forks source link

Interface for Tokeniser, Allow Custom Tokenisers? #3

Closed ghost closed 7 years ago

ghost commented 7 years ago

Hi James,

Would you be open to a PR that made some changes to the Tokeniser type, and to dependent types, to allow for custom Tokenisers? This would make nlp more general for different languages, or for handling different tokenisation strategies.

What I'm imagining is this (note, the workflow is designed to avoid breaking API changes):

I could probably make the required changes quickly enough, if you're interested. :)

james-bowman commented 7 years ago

Hi Cathal,

Thanks so much for getting in touch and great points. As coincidence would have it, I was actually talking to a friend about this last night following his experiences using NLTK for NLP where substitutional tokenisation strategies are available. I had started work on some of the changes you suggested so following your email went ahead and made some more. Really great suggestions. I have committed the changes so please feel free to take a look and let me know your thoughts. Thanks again for getting in touch and if you have any other suggestions or want to make Pull Requests please do let me know.

Thanks again,

James