snipsco / snips-nlu

Snips Python library to extract meaning from text
https://snips-nlu.readthedocs.io
Apache License 2.0
3.89k stars 513 forks source link

Do you have plan to support Thai language ? #836

Open guftgift opened 5 years ago

guftgift commented 5 years ago

Dear Sir, Do you have Thai language support in your road map ?

adrienball commented 5 years ago

Hi @guftgift , I'm afraid it is not planned for now.

guftgift commented 5 years ago

Any suggestions to use snips with Thai language.

On Thu, 25 Jul 2019 at 15:23 Adrien Ball notifications@github.com wrote:

Closed #836 https://github.com/snipsco/snips-nlu/issues/836.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/snipsco/snips-nlu/issues/836?email_source=notifications&email_token=AE27HS6WK6IXG7L4SH5BQV3QBFPHLA5CNFSM4IGXBKF2YY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOSWD4JUA#event-2508702928, or mute the thread https://github.com/notifications/unsubscribe-auth/AE27HS7QGWACNHFZOCSU2NDQBFPHLANCNFSM4IGXBKFQ .

--

Best Regards, Lalida Boonmana Guru Services Co., Ltd. lalida@guru-services.co.th 66-81-402-8787

adrienball commented 5 years ago

@guftgift It seems that the Thai language has no separator between words (correct me if I'm wrong), and snips-nlu is not meant to be used directly on inputs that are not whitespace-separated. This means that in order to work with Thai, you will first have to find a way to tokenize your data before using it with snips-nlu. This is true both for training and inference. Apparently, a romanization of the Thai language exists, that would be something to investigate (Google translate gives an transliterated form with whitespaces).

Once you've managed to handle this, you can try to set the language of your dataset to 'en' (english) and use the english default config to train your nlu engine after having made the following changes to the "feature_factory_configs" attribute of the config:

I can't guarantee that this will work well, but that is probably worth a try. Note that, builtin entities will only work for values valid in english.

I hope this helps.

guftgift commented 5 years ago

Dear Adrien, Thai language doesn't have separator but we can tokenize it. We have some open source projects for Thai tokenization such as https://github.com/PyThaiNLP/pythainlp.

Thank you for your information I will try to follow your instruction. If you have plan to support Thai I will appreciate it so much. And I will try to contribute it as much I can.

Sent from my iPad

On 25 Jul 2019, at 20:30, Adrien Ball notifications@github.com wrote:

@guftgift It seems that the Thai language has no separator between words (correct me if I'm wrong), and snips-nlu is not meant to be used directly on inputs that are not whitespace-separated. This means that in order to work with Thai, you will first have to find a way to tokenize your data before using it with snips-nlu. This is true both for training and inference. Apparently, a romanization of the Thai language exists, that would be something to investigate (Google translate gives an transliterated form with whitespaces).

Once you've managed to handle this, you can try to set the language of your dataset to 'en' (english) and use the english default config to train your nlu engine after having made the following changes to the "feature_factory_configs" attribute of the config:

remove the word_cluster feature factory set use_stemming to False everywhere replace top_10000_words_stemmed by None everywhere I can't guarantee that this will work well, but that is probably worth a try. Note that, builtin entities will only work for values valid in english.

I hope this helps.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.