keiffster / program-y

Python 3.x based AIML 2.0 Chatbot interpreter, framework, related programs and knowledge files
https://keiffster.github.io/program-y/
Other
350 stars 136 forks source link

It seems STAR(*) dose not work properly in PATTERN with Chinese. #123

Closed HCIS2020 closed 6 years ago

HCIS2020 commented 6 years ago

Dear Sir,

I define a pattern as follow: `

* 你好
    <template>
    你也好, from <star/>
    </template>
</category>

` when I test this pattern, the results show that:

`

你好. 对不起,我不知道答案,还需要不断学习!. XX 你好. 你也好, from XX. XX你好. 对不起,我不知道答案,还需要不断学习!. 问你好. 对不起,我不知道答案,还需要不断学习!. `

this means * in pattern could not match Chinese except English.

please show me some clue how to fix this problem.

HCIS2020 commented 6 years ago

Here is a function:

def isChinese(c):

http://www.iteye.com/topic/558050

r = [
    # 标准CJK文字
    (0x3400, 0x4DB5), (0x4E00, 0x9FA5), (0x9FA6, 0x9FBB), (0xF900, 0xFA2D),
    (0xFA30, 0xFA6A), (0xFA70, 0xFAD9), (0x20000, 0x2A6D6), (0x2F800, 0x2FA1D),
    # 全角ASCII、全角中英文标点、半宽片假名、半宽平假名、半宽韩文字母
    (0xFF00, 0xFFEF),
    # CJK部首补充
    (0x2E80, 0x2EFF),
    # CJK标点符号
    (0x3000, 0x303F),
    # CJK笔划
    (0x31C0, 0x31EF)]
return any(s <= ord(c) <= e for s, e in r)
keiffster commented 6 years ago

Hi there, unfortunately I don't speak Chinese so may need a little help getting this to work.

From the examples you provide it is treating 你好 as a 2 character word ( as per english language, ) but from the examples I take it XX你好, is really 3 words XX 你 好.

I therefore need to look at how to split input string into english words and chinese

Is every chinese symbol a unique word, or are the 2, 3 or more symbol words ?

I guess I would use your function ( thanks ) to parse each character and if true seperate it with spaces for the parser to work

K

HCIS2020 commented 6 years ago

Here is a project that implement AIML Parser support Chinese very Well.

https://github.com/Decalogue/aiml3

You could find code in aiml3/aiml/LangSupport.py

keiffster commented 6 years ago

Thanks, and useful code in that project. I'd already started on the work and created my own splitter which is working fine.

Right now I need to add the splitter to the pattern parser and that should be all working

Maybe 2 or 3 more days inc testing and i'll have it ready for release

Thanks for the info and the guidance, really useful

K

seghcder commented 6 years ago

While you are in splitter territory, perhaps also take a look at sentence split for English? Using just '.' causes issues with email addresses, IP addresses and host names etc. I look for '. ' (dot space) instead but it may be something to parameterise (eg Mandarin uses unicode U+3002 for end of sentence) and perhaps also be an array to handle !?. 。etc.

FYI - the Chinese particle 吗 indicates a question like ? does, but otherwise the sentence wording is identical to a statement (see link). Is there a way to add "metadata" to a sentence to indicate intent. eg ? or 吗 means question. ! or ALL CAPS means EXCLAMATION and perhaps in future NLP could detect and flag sarcasm (yeah right). Not sure this is really supported in AIML, but maybe something for a future revision of the standard?

A key customer of my client is based in China. Chinese support has been asked about, but I suspect this will be a long journey :-)

HCIS2020 commented 6 years ago

HERE is another Chatbot Framework based on NLP, You could try this project.

https://github.com/crownpku/Rasa_NLU_Chi

It has already supported Chinese ,but lack of Chinese training data.

keiffster commented 6 years ago

Hi Sean, could you give me some examples of what is failing, the normaliser and denormalizer should take care of this sort of thing.

In terms of sentence splitting and ending, I am looking at adding a configurable option where you can specify the characters for each, at the moment it just uses default ".:;?!"

keiffster commented 6 years ago

Hi Benny, I have a version of Program-Y that now works with as much Chinese as I can understand from the emails we had and the links you've sent.

I'll upload to master shortly along with documentation of a new configuration settings, pre and post processor options needed

K

keiffster commented 6 years ago

Code going to master later this afternoon after the build and tests complete, but have added documentation already

https://github.com/keiffster/program-y/wiki/Multi-Language

keiffster commented 6 years ago

Pushed

seghcder commented 6 years ago

Let me come back to you re any issues after I sync to your latest version. Perhaps I could do a better job of normalise / denormalise so also want to check that.

HCIS2020 commented 6 years ago

Hi keiffster:

I follow your multi-language support document and test my AIML file. I found out (STAR)你好(STAR) PATTERN, can only match 是你好的,could NOT match 是你好 or 你好的 Instead I use #你好# PATTERN, could match 是你好的, 是你好, 你好的(three cases).

Most important issue is :

When I have following AIML:

`<?xml version="1" encoding="UTF-8" ?>

START INSURANCE INSURANCE STEP CLIENT * 你的姓名是什么? 退出 你的姓名是什么? INSURANCE STEP VIP * 欢迎你,我们的VIP贵宾,您的密码是 退出 欢迎你,我们的VIP贵宾,您的密码是 INSURANCE STEP CARNO * 你的车牌号码是什么? 退出 你的车牌号码是什么? INSURANCE STEP CITY * 你经常开车的城市是哪里? * 退出 你经常开车的城市是哪里? * INSURANCE STEP CLASS * 你希望的保额是多少? * 退出 你希望的保额是多少? * INSURANCE STEP CELLPHONE * 你的手机号码是什么? 退出 你的手机号码是什么? ` When the robot said: 你的姓名是什么? I input my name as 名字 this PATTERN ` * 你的姓名是什么? ` could NOT match. I always get default response.
keiffster commented 6 years ago

The star and _ patterns means 1 or more so is working as you discovered where as # and ^ mean zero or more

I’ll use your comprehensive grammar and do some testing around the language splitter

Thanks for being patient

keiffster commented 6 years ago

I think I can see the problem, I think I have not applied Chinese word split to the that processing, so its seeing that as a single word of multiple Chinese graphics, and not seperate words

Leave it with me and I'll get a fix

HCIS2020 commented 6 years ago

keiffster:

Hope you could get it done as soon as possible. If this problem could NOT solve, our customer Could NOT easily find the service entrance for AI Service ROBOT.

keiffster commented 6 years ago

Still working on it, getting closer. Just needed to finish the admin tool enough for it to be used by another user and that went to master last night

I’ll get back into your problem now as the priority

K

tomliau33 commented 6 years ago

Hi Keiffster,

Thanks for your great work on program-y. I had tested the new feature Chinese language support, and I found that it was not worked well on '*' character. I had my own modification to support Chinese, and I did some additional work to make it handle '*' character more well. In my implementation, others can implement their own tokenizer to support other languages. You can check my modification on below commits, and hope it can help program-y supporting Multi-Language more well: https://github.com/tomliau33/program-y/commit/21f5a71d465331764022b38f79e5673f48517879 https://github.com/tomliau33/program-y/commit/c866048c4664e4755b1e5b6cb30d6f9aa35481c1 https://github.com/tomliau33/program-y/commit/e0dbb62765e71125644847d2ee284f189a0163ae

To enable Chinese, you need to add below configure in "bot" and "brain" section. # Tokenizer tokenizer: classname: programy.parser.tokenizer.CjkTokenizer

Thank you!

keiffster commented 6 years ago

Wow awesome contribution, I’ll merge to a branch in the next couple of days and push out to everyone for testing

Thank you so so much

tomliau33 commented 6 years ago

Hi Keiffster,

I have a question about logging code, the logging code in program-y is as below:

           if logging.getLogger().isEnabledFor(logging.WARNING):
                logging.warning("'tokenizer' section missing from bot config, using defaults")

In my project, I only use 'logging.warning("'log text")' to log message. I just curious about what is the benefit of the code "if logging.getLogger().isEnabledFor(logging.WARNING):".

As I know, "logging.warning('message') method also called the same method ".isEnabledFor(logging.WARNING)" in the beginning of it's implementation, so I'm curious about is there other purpose to check this condition before call logging.warning() method.

Thank you!

Best regards, Chiyi

HCIS2020 commented 6 years ago

Hi tomliau33:

I have tested your code for Chinese support. I found one problem is when the text as following:

200万 (上海,杭州) the Brackets is English style.

The program could NOT process correctly. You could test this problem using my AIML file above.

keiffster commented 6 years ago

The issue is not with tokenizing but with the processing in set/get nodes. I'm assuming from the above text that you see the issue in the flowbot script that you created and the (上海,杭州)is a selection

I've seen the same problem and currently working out what the issue is

keiffster commented 6 years ago

I merged the tokenizer code into the mainline over the weekend and fixed up a couple of minor issues. I have pushed what I have to a branch called 'tokenizer'. This is basically master with tokenizer code merged in.

If this works, then I have a couple of additional mods I want to make to streamline the configuration, but it seems to run your insurance flow

tomliau33 commented 6 years ago

Hi keiffster,

I had also fixed the 200万 problem, you can check https://github.com/tomliau33/program-y/commit/7abaadb900b3133c42a901e4b73c95529a188dce for my fixing.

The prev_is_cjk variable can be removed, my original idea of prev_is_cjk is to make Chinese sentence more well-format if Non-Chinese word existed in sentence, ex: "['我','有', '200', '萬']" will be merged to "我有 200 萬". I didn't consider the correctness. Actually, "我有200萬" is more correctly. So I removed additional condition checking to make it more correctly in my fixing.

Best regards, Chiyi

HCIS2020 commented 6 years ago

Hi keiffster,

I have tested the new release program. It solved THAT problem, but a new problem is CONDITION VALUE has Chinese. My program run log as following:

sessionid: 12345678 C2 Value = [上 海] C2 Condition = [上海] C2 Condition = [北京] C2 Condition = [深圳] C2 Condition = [杭州] C2 Condition = [西安]

keiffster commented 6 years ago

Is the problem the logging statements ?, or a problem with the condition equality ?

I left the logging in until I saw the solution working. I can remove and push to tokenizer branch if that helps

K

HCIS2020 commented 6 years ago

It's problem of condition equality. When I set condition match value as "上 海" with space between them. The program works well.

I have tested tokenizer branch, it works with * in PATTERN, and THAT, CONDITION with Chinese.

I think you could remove preprocess and postprocess for Chinese support. Just merge tokenizer branch.

keiffster commented 6 years ago

ok let me look at it and see what the issue with spaces is, once thats concluded I'll merge with master which will also remove the logging statements

I found it much easier to debug when I ran you text through Google translate as it took me a few goes to understand that one of the questions was a cell number and therefore why it was an int validation !!!!

K

keiffster commented 6 years ago

1.9 has significantly improved the support for Chinese. I have even fixed the 200 万 space issue. Should be with you soon

keiffster commented 6 years ago

Now released in 1.9