james-bowman / nlp

Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang
MIT License
446 stars 45 forks source link

Vectorisers.go only tokenise for a-Z languages. #5

Closed ThomasK81 closed 6 years ago

ThomasK81 commented 6 years ago

Hi,

Thanks for your hard work. I really like your code base. I have noticed that the package as is only works for languages that can be expressed in a-Z alphabets and, in addition, the hardcoded stop words make it a bit challenging for even historic or fringe English corpora. I have a fix for both. But did not want to PR without creating the issue first and see if you want to open up the project for non-English, historic English, and non a-Z languages.

Thanks again!

Best,

Thomas

james-bowman commented 6 years ago

Hi Thomas, thanks. Expanding support for other languages, historic or specialist English corpora would be fantastic. PRs would be very welcome, indeed.

Regards

James

ThomasK81 commented 6 years ago

Great. I have created a PR. Let me know what you think.

Regards,

Thomas

james-bowman commented 6 years ago

PR #6 Merged - thanks.