james-bowman / nlp

Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang
MIT License
446 stars 45 forks source link

replace default regexp in tokenizer on more universal #11

Closed recoilme closed 3 years ago

recoilme commented 4 years ago

old regexp - "[\p{L}]+" convert documents like "os24120z R2D2" to ["os","z","R","D"] replaced with \S - not whitespace [^\t\n\f\r ]