Open JobiJoba opened 4 years ago
In terms of UTF-8 it's supposed to be 0xe0b8aa 0xe0b8b3 but I've often seen 0xe0b8aa 0xe0b98d 0xe0b8b2 (mostly in pdf documents). Was an encoding different from UTF-8 used? What was the encoding for สำ (you can do a hexdump on it).
It's UTF8 with BOM (Visual studio code). I manage to search and replace all the char at the end. Pretty annoying but it works ;)
I would have been interesting to know what the unrecognized encoding was. On swath both of the options for encoding สำ I mentioned are split correctly. For programs like wordcut it would be good to know when the current code doesn't work for weird cases "in the wild".
I still have the original file, how do you do an hex dump ? Here is the sentence which cause issue สํา อย่างน้อยก็สําหรับพวกเขา
I would think VS Code must have a way to display the underlying byte sequences? Or try this: https://www.di-mgt.com.au/hexdump-for-windows.html
The fragment you posted here is regular ส-เสือ ำ-สระอัม but something might be lost-in-transition somewhere. If you make (a fragment of) the original file available I would love to take a look..!
Sorry I don't have Windows :< But here is the file (https://wetransfer.com/downloads/0fc3695f682166669ea7eddce8d191b520200725032821/8d2412039769703fcd8298e92399cc2520200725032834/01ddb2)
What I'm doing currently is taking the subtitles in Thai and English ; merge them ; cut thai sentence word by word.
Sorry, didn't mean to offend you, of course Code is available on other platforms..! On Linux I use Hexdump, but there are many tools like it, some installed by default.
Thanks for the file. Indeed, the 0xe0b8aa 0xe0b98d 0xe0b8b2 representation is used. In Swath, this works OK. I could be interesting for the author of Wordcut to take this irregular representation into account, ie. 0xe0b98d 0xe0b8b2 for 0xe0b8b3.
No offend at all ;)
Swath is currently only available in CPP right ? I would love to test it ^^
It's easy to build, you need to install the packages libtool and libdatrie-dev first, and then ./autogen.sh; ./configure; make
I can't find the libdatrie-dev for macOS unfortunately do you have a link ?
In that case you have to build that one too: https://github.com/tlwg/libdatrie
Or, you can do ./configure --disable-dict
instead of a plain ./configure..!
Will it work if I steal th_wnormalize from libthai?
@JobiJoba Do you have to use JS?
Will it work if I steal th_wnormalize from libthai?
It however looks a bit scary. https://github.com/tlwg/libthai/blob/d43ca014594d4aebd71e2171a5c8eafe10fe5ed4/src/thstr/thstr.c#L43
Worth trying. You could also do one pass and replace all 0xe0b98d 0xe0b8b2 by 0xe0b8b3 and then it would work as is.
I could rewrite my algorithm in python or dart but it needs to be worth ;) Right now I use wordcut as a fallback when my algo cannot resolve a sentence.
I must say I don't understand well your algorithm but it cut words too much
Worth trying. You could also do one pass and replace all 0xe0b98d 0xe0b8b2 by 0xe0b8b3 and then it would work as is.
Are you interested in creating a pull request?
Not instantly, I'm just off for a holiday today..! But it shouldn't be too hard, it's mostly finding where to put it in the code. It would be easy for you. ;-)
I could rewrite my algorithm in python or dart but it needs to be worth ;)
I have a wordcut in Rust (https://github.com/veer66/chamkho), and in Python (https://github.com/veer66/wordcutpy) too. If JS is not obligatory, Rust version runs much faster than this one.
Right now I use wordcut as a fallback when my algo cannot resolve a sentence.
I must say I don't understand well your algorithm but it cut words too much
It groups substring together by matching a word in the word list https://github.com/veer66/wordcut/tree/master/data or rules. In this case, I think substring cannot be matched if any word in the word list.
So adding a function for replacing string as @pepa65 told in https://github.com/veer66/wordcut/blob/f3f403c6e777da1d394a82dabc5d9e5f9099980c/lib/wordcut_core.js#L3 should be okay.
Not instantly, I'm just off for a holiday today..! But it shouldn't be too hard, it's mostly finding where to put it in the code. It would be easy for you. ;-)
You can put your code after these lines.
I will be glad if you refactor it too. 😅
Mmmm interesting ... I could rewrite the whole thing in python to have multiple fallback ;) My algo, yours, and pythainlp :D
i found the approach with adding the faulty sequence to the dictionary easier. In the code, you have to try both, and for that you have to decide where to cut, I don't know how to do that cleanly, so it works for texts with mixed cases. If it is only one or the other it is easier...
Hey,
I'm trying to cut the following sentence:
ในขณะนั้นดูราวกับว่า อย่างน้อยก็สําหรับพวกเขา and this is how he cut it ใน|ขณะ|นั้น|ดู|ราว|กับ|ว่า| |อย่าง|น้อ|ยก|็สํา|หรับ|พวก|เขา
This part |อย่าง|น้อ|ยก|็สํา|หรับ|พวก|เขา does not work obviously ;) I try to understand why he apply a wrong rule in that case...
I've added อย่างน้อย to the custom dictionary but he's still cutting wrongly. Is there a way to put a priority on the custom dictionary ?
Thanks
EDIT: Ok ... I think I found the issue it comes from สำหรับ which was encoded weirdly (สำ in fact is written has one char... I don't know how they do that ^^) I got that issue multiple time in my file ... I'll try to find a way to discover them.