Closed HCIS2020 closed 6 years ago
Here is a function:
def isChinese(c):
r = [
# 标准CJK文字
(0x3400, 0x4DB5), (0x4E00, 0x9FA5), (0x9FA6, 0x9FBB), (0xF900, 0xFA2D),
(0xFA30, 0xFA6A), (0xFA70, 0xFAD9), (0x20000, 0x2A6D6), (0x2F800, 0x2FA1D),
# 全角ASCII、全角中英文标点、半宽片假名、半宽平假名、半宽韩文字母
(0xFF00, 0xFFEF),
# CJK部首补充
(0x2E80, 0x2EFF),
# CJK标点符号
(0x3000, 0x303F),
# CJK笔划
(0x31C0, 0x31EF)]
return any(s <= ord(c) <= e for s, e in r)
Hi there, unfortunately I don't speak Chinese so may need a little help getting this to work.
From the examples you provide it is treating 你好 as a 2 character word ( as per english language, ) but from the examples I take it XX你好, is really 3 words XX 你 好.
I therefore need to look at how to split input string into english words and chinese
Is every chinese symbol a unique word, or are the 2, 3 or more symbol words ?
I guess I would use your function ( thanks ) to parse each character and if true seperate it with spaces for the parser to work
K
Here is a project that implement AIML Parser support Chinese very Well.
https://github.com/Decalogue/aiml3
You could find code in aiml3/aiml/LangSupport.py
Thanks, and useful code in that project. I'd already started on the work and created my own splitter which is working fine.
Right now I need to add the splitter to the pattern parser and that should be all working
Maybe 2 or 3 more days inc testing and i'll have it ready for release
Thanks for the info and the guidance, really useful
K
While you are in splitter territory, perhaps also take a look at sentence split for English? Using just '.' causes issues with email addresses, IP addresses and host names etc. I look for '. ' (dot space) instead but it may be something to parameterise (eg Mandarin uses unicode U+3002 for end of sentence) and perhaps also be an array to handle !?. 。etc.
FYI - the Chinese particle 吗 indicates a question like ? does, but otherwise the sentence wording is identical to a statement (see link). Is there a way to add "metadata" to a sentence to indicate intent. eg ? or 吗 means question. ! or ALL CAPS means EXCLAMATION and perhaps in future NLP could detect and flag sarcasm (yeah right). Not sure this is really supported in AIML, but maybe something for a future revision of the standard?
A key customer of my client is based in China. Chinese support has been asked about, but I suspect this will be a long journey :-)
HERE is another Chatbot Framework based on NLP, You could try this project.
https://github.com/crownpku/Rasa_NLU_Chi
It has already supported Chinese ,but lack of Chinese training data.
Hi Sean, could you give me some examples of what is failing, the normaliser and denormalizer should take care of this sort of thing.
In terms of sentence splitting and ending, I am looking at adding a configurable option where you can specify the characters for each, at the moment it just uses default ".:;?!"
Hi Benny, I have a version of Program-Y that now works with as much Chinese as I can understand from the emails we had and the links you've sent.
I'll upload to master shortly along with documentation of a new configuration settings, pre and post processor options needed
K
Code going to master later this afternoon after the build and tests complete, but have added documentation already
Pushed
Let me come back to you re any issues after I sync to your latest version. Perhaps I could do a better job of normalise / denormalise so also want to check that.
Hi keiffster:
I follow your multi-language support document and test my AIML file. I found out (STAR)你好(STAR) PATTERN, can only match 是你好的,could NOT match 是你好 or 你好的 Instead I use #你好# PATTERN, could match 是你好的, 是你好, 你好的(three cases).
Most important issue is :
When I have following AIML:
`<?xml version="1" encoding="UTF-8" ?>
The star and _ patterns means 1 or more so is working as you discovered where as # and ^ mean zero or more
I’ll use your comprehensive grammar and do some testing around the language splitter
Thanks for being patient
I think I can see the problem, I think I have not applied Chinese word split to the that
processing, so its seeing that
as a single word of multiple Chinese graphics, and not seperate words
Leave it with me and I'll get a fix
keiffster:
Hope you could get it done as soon as possible. If this problem could NOT solve, our customer Could NOT easily find the service entrance for AI Service ROBOT.
Still working on it, getting closer. Just needed to finish the admin tool enough for it to be used by another user and that went to master last night
I’ll get back into your problem now as the priority
K
Hi Keiffster,
Thanks for your great work on program-y. I had tested the new feature Chinese language support, and I found that it was not worked well on '*' character. I had my own modification to support Chinese, and I did some additional work to make it handle '*' character more well. In my implementation, others can implement their own tokenizer to support other languages. You can check my modification on below commits, and hope it can help program-y supporting Multi-Language more well: https://github.com/tomliau33/program-y/commit/21f5a71d465331764022b38f79e5673f48517879 https://github.com/tomliau33/program-y/commit/c866048c4664e4755b1e5b6cb30d6f9aa35481c1 https://github.com/tomliau33/program-y/commit/e0dbb62765e71125644847d2ee284f189a0163ae
To enable Chinese, you need to add below configure in "bot" and "brain" section. # Tokenizer tokenizer: classname: programy.parser.tokenizer.CjkTokenizer
Thank you!
Wow awesome contribution, I’ll merge to a branch in the next couple of days and push out to everyone for testing
Thank you so so much
Hi Keiffster,
I have a question about logging code, the logging code in program-y is as below:
if logging.getLogger().isEnabledFor(logging.WARNING):
logging.warning("'tokenizer' section missing from bot config, using defaults")
In my project, I only use 'logging.warning("'log text")' to log message. I just curious about what is the benefit of the code "if logging.getLogger().isEnabledFor(logging.WARNING):".
As I know, "logging.warning('message') method also called the same method ".isEnabledFor(logging.WARNING)" in the beginning of it's implementation, so I'm curious about is there other purpose to check this condition before call logging.warning() method.
Thank you!
Best regards, Chiyi
Hi tomliau33:
I have tested your code for Chinese support. I found one problem is when the text as following:
200万 (上海,杭州) the Brackets is English style.
The program could NOT process correctly. You could test this problem using my AIML file above.
The issue is not with tokenizing but with the processing in set/get nodes. I'm assuming from the above text that you see the issue in the flowbot script that you created and the (上海,杭州)is a selection
I've seen the same problem and currently working out what the issue is
I merged the tokenizer code into the mainline over the weekend and fixed up a couple of minor issues. I have pushed what I have to a branch called 'tokenizer'. This is basically master with tokenizer code merged in.
If this works, then I have a couple of additional mods I want to make to streamline the configuration, but it seems to run your insurance flow
Hi keiffster,
I had also fixed the 200万 problem, you can check https://github.com/tomliau33/program-y/commit/7abaadb900b3133c42a901e4b73c95529a188dce for my fixing.
The prev_is_cjk variable can be removed, my original idea of prev_is_cjk is to make Chinese sentence more well-format if Non-Chinese word existed in sentence, ex: "['我','有', '200', '萬']" will be merged to "我有 200 萬". I didn't consider the correctness. Actually, "我有200萬" is more correctly. So I removed additional condition checking to make it more correctly in my fixing.
Best regards, Chiyi
Hi keiffster,
I have tested the new release program. It solved THAT problem, but a new problem is CONDITION VALUE has Chinese. My program run log as following:
sessionid: 12345678 C2 Value = [上 海] C2 Condition = [上海] C2 Condition = [北京] C2 Condition = [深圳] C2 Condition = [杭州] C2 Condition = [西安]
Is the problem the logging statements ?, or a problem with the condition equality ?
I left the logging in until I saw the solution working. I can remove and push to tokenizer branch if that helps
K
It's problem of condition equality. When I set condition match value as "上 海" with space between them. The program works well.
I have tested tokenizer branch, it works with * in PATTERN, and THAT, CONDITION with Chinese.
I think you could remove preprocess and postprocess for Chinese support. Just merge tokenizer branch.
ok let me look at it and see what the issue with spaces is, once thats concluded I'll merge with master which will also remove the logging statements
I found it much easier to debug when I ran you text through Google translate as it took me a few goes to understand that one of the questions was a cell number and therefore why it was an int validation !!!!
K
1.9 has significantly improved the support for Chinese. I have even fixed the 200 万 space issue. Should be with you soon
Now released in 1.9
Dear Sir,
I define a pattern as follow: `
` when I test this pattern, the results show that:
`
this means * in pattern could not match Chinese except English.
please show me some clue how to fix this problem.