veer66 / wordcut

Thai word breaker for Node.js
GNU Lesser General Public License v3.0
141 stars 40 forks source link

Issue with punctuation marks #23

Open artt opened 3 years ago

artt commented 3 years ago

It seems like the library has issues dealing with punctuation marks. For example,

wordcut.cut("ป่วยหรืออ่อนแอ?")
// ป่วย|หรือ|อ่อน|แอ?

But...

wordcut.cut("ป่วยหรืออ่อนแอ")
// ป่วย|หรือ|อ่อนแอ
pepa65 commented 3 years ago

Shouldn't be too hard to exclude punctuation from the analysis..?

artt commented 3 years ago

Ah I'm using this for a search engine which would highlight matched words but it doesn't allow infix search. So I'm segmenting the queried words along with the indexed data. Removing punctuation marks would result in an altered version of the match.

For example, searching for อ่อนแอ should ideally return

ป่วยหรือ<mark>อ่อนแอ</mark>?

Doing what you suggested would return

ป่วยหรือ<mark>อ่อนแอ</mark>

I realize that this is a very specific use case, but just wanna note the discrepancy.

pepa65 commented 3 years ago

It is obviously a bug, that the result would be different. I was just thinking it might not be too hard to fix for @veer66..!