veer66 / wordcut

Thai word breaker for Node.js
GNU Lesser General Public License v3.0
141 stars 40 forks source link

Weird situation with a sentence #21

Open JobiJoba opened 4 years ago

JobiJoba commented 4 years ago

Hey,

I'm trying to cut the following sentence:

ในขณะนั้นดูราวกับว่า อย่างน้อยก็สําหรับพวกเขา and this is how he cut it ใน|ขณะ|นั้น|ดู|ราว|กับ|ว่า| |อย่าง|น้อ|ยก|็สํา|หรับ|พวก|เขา

This part |อย่าง|น้อ|ยก|็สํา|หรับ|พวก|เขา does not work obviously ;) I try to understand why he apply a wrong rule in that case...

I've added อย่างน้อย to the custom dictionary but he's still cutting wrongly. Is there a way to put a priority on the custom dictionary ?

Thanks

EDIT: Ok ... I think I found the issue it comes from สำหรับ which was encoded weirdly (สำ in fact is written has one char... I don't know how they do that ^^) I got that issue multiple time in my file ... I'll try to find a way to discover them.

pepa65 commented 4 years ago

In terms of UTF-8 it's supposed to be 0xe0b8aa 0xe0b8b3 but I've often seen 0xe0b8aa 0xe0b98d 0xe0b8b2 (mostly in pdf documents). Was an encoding different from UTF-8 used? What was the encoding for สำ (you can do a hexdump on it).

JobiJoba commented 4 years ago

It's UTF8 with BOM (Visual studio code). I manage to search and replace all the char at the end. Pretty annoying but it works ;)

pepa65 commented 4 years ago

I would have been interesting to know what the unrecognized encoding was. On swath both of the options for encoding สำ I mentioned are split correctly. For programs like wordcut it would be good to know when the current code doesn't work for weird cases "in the wild".

JobiJoba commented 4 years ago

I still have the original file, how do you do an hex dump ? Here is the sentence which cause issue สํา อย่างน้อยก็สําหรับพวกเขา

pepa65 commented 4 years ago

I would think VS Code must have a way to display the underlying byte sequences? Or try this: https://www.di-mgt.com.au/hexdump-for-windows.html

The fragment you posted here is regular ส-เสือ ำ-สระอัม but something might be lost-in-transition somewhere. If you make (a fragment of) the original file available I would love to take a look..!

JobiJoba commented 4 years ago

Sorry I don't have Windows :< But here is the file (https://wetransfer.com/downloads/0fc3695f682166669ea7eddce8d191b520200725032821/8d2412039769703fcd8298e92399cc2520200725032834/01ddb2)

What I'm doing currently is taking the subtitles in Thai and English ; merge them ; cut thai sentence word by word.

pepa65 commented 4 years ago

Sorry, didn't mean to offend you, of course Code is available on other platforms..! On Linux I use Hexdump, but there are many tools like it, some installed by default.

Thanks for the file. Indeed, the 0xe0b8aa 0xe0b98d 0xe0b8b2 representation is used. In Swath, this works OK. I could be interesting for the author of Wordcut to take this irregular representation into account, ie. 0xe0b98d 0xe0b8b2 for 0xe0b8b3.

JobiJoba commented 4 years ago

No offend at all ;)

Swath is currently only available in CPP right ? I would love to test it ^^

pepa65 commented 4 years ago

It's easy to build, you need to install the packages libtool and libdatrie-dev first, and then ./autogen.sh; ./configure; make

JobiJoba commented 4 years ago

I can't find the libdatrie-dev for macOS unfortunately do you have a link ?

pepa65 commented 4 years ago

In that case you have to build that one too: https://github.com/tlwg/libdatrie Or, you can do ./configure --disable-dict instead of a plain ./configure..!

veer66 commented 4 years ago

Will it work if I steal th_wnormalize from libthai?

veer66 commented 4 years ago

@JobiJoba Do you have to use JS?

veer66 commented 4 years ago

Will it work if I steal th_wnormalize from libthai?

It however looks a bit scary. https://github.com/tlwg/libthai/blob/d43ca014594d4aebd71e2171a5c8eafe10fe5ed4/src/thstr/thstr.c#L43

pepa65 commented 4 years ago

Worth trying. You could also do one pass and replace all 0xe0b98d 0xe0b8b2 by 0xe0b8b3 and then it would work as is.

JobiJoba commented 4 years ago

I could rewrite my algorithm in python or dart but it needs to be worth ;) Right now I use wordcut as a fallback when my algo cannot resolve a sentence.

I must say I don't understand well your algorithm but it cut words too much

veer66 commented 4 years ago

Worth trying. You could also do one pass and replace all 0xe0b98d 0xe0b8b2 by 0xe0b8b3 and then it would work as is.

Are you interested in creating a pull request?

pepa65 commented 4 years ago

Not instantly, I'm just off for a holiday today..! But it shouldn't be too hard, it's mostly finding where to put it in the code. It would be easy for you. ;-)

veer66 commented 4 years ago

I could rewrite my algorithm in python or dart but it needs to be worth ;)

I have a wordcut in Rust (https://github.com/veer66/chamkho), and in Python (https://github.com/veer66/wordcutpy) too. If JS is not obligatory, Rust version runs much faster than this one.

Right now I use wordcut as a fallback when my algo cannot resolve a sentence.

I must say I don't understand well your algorithm but it cut words too much

It groups substring together by matching a word in the word list https://github.com/veer66/wordcut/tree/master/data or rules. In this case, I think substring cannot be matched if any word in the word list.

So adding a function for replacing string as @pepa65 told in https://github.com/veer66/wordcut/blob/f3f403c6e777da1d394a82dabc5d9e5f9099980c/lib/wordcut_core.js#L3 should be okay.

veer66 commented 4 years ago

Not instantly, I'm just off for a holiday today..! But it shouldn't be too hard, it's mostly finding where to put it in the code. It would be easy for you. ;-)

You can put your code after these lines.

I will be glad if you refactor it too. 😅

JobiJoba commented 4 years ago

Mmmm interesting ... I could rewrite the whole thing in python to have multiple fallback ;) My algo, yours, and pythainlp :D

pepa65 commented 4 years ago

i found the approach with adding the faulty sequence to the dictionary easier. In the code, you have to try both, and for that you have to decide where to cut, I don't know how to do that cleanly, so it works for texts with mixed cases. If it is only one or the other it is easier...