micromax / simplenlg

Automatically exported from code.google.com/p/simplenlg
0 stars 0 forks source link

Words in complete Upper Case.(4.2) #13

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. testPredModification test in NounPhraseTest.java
(provided with 4.2 pacakage)
2.
3.

What is the expected output? What do you see instead?
Expected:the salacious man
Actual:the salacious MAN

What version of the product are you using? On what operating system?
4.2 on Windows XP

Please provide any additional information below.
In 4.2 I am noticing this problem, where words are getting output in Upper 
Case.(See example above -"MAN") I was testing witnh 4.1 before and this problem 
was not there. Any solutions to this.

Original issue reported on code.google.com by manish.s...@gmail.com on 6 May 2011 at 12:40

GoogleCodeExporter commented 8 years ago
When I run this test (on an XP machine), it works fine.  Which lexicon are you 
using?  Have you modified it?

Original comment by ehud.rei...@gmail.com on 9 May 2011 at 4:35

GoogleCodeExporter commented 8 years ago
We did remove some incomplete categories from the Lexicon. But when I run the 
query on the lexicon
select * from lex_record where base like 'man'

I get 3 rows
MAN noun
man noun
man verb

So I believe it is picking up the first entry which is an acronym. Is there a 
relationship in lexicon that needs to be preserved which dictates the order of 
the output.

Original comment by manish.s...@gmail.com on 22 May 2011 at 6:57

GoogleCodeExporter commented 8 years ago
Adding to the above, we are using MS SQL server.

Original comment by manish.s...@gmail.com on 22 May 2011 at 6:58

GoogleCodeExporter commented 8 years ago
I think the problem is due to the fact that HSQL (default DB for NIH Specialist 
Lexicon) by default does case-sensitive matching, while most other DBs by 
default do case-insensitive matching.

I have committed a change to the lexicon class (under the source tab) which 
hopefully should fix this, but I can't test it since I use HSQL.   Could you
test this (you'll need to download and compile the source) and let me know if
it solves the problem

Another alternative would be to change the design of the lexicon table in MS 
SQL.  I don't know MS SQL well, but most DBs allow case 
sensitivity/insensitivity to be specified in a column as part of the design of 
the table

Original comment by ehud.rei...@gmail.com on 25 May 2011 at 8:21

GoogleCodeExporter commented 8 years ago
Thanks for the change. It solved the issue with NounPhraseTest.java. However, 
in the testStringRecognition in ClauseTest.java I am getting "my cat is SAD" 
instead of "my cat is sad". Debugging into code, it seems sad is being 
recognized as a NOUN here and when fetching the lexical records from NIH db 
"sad" is Adj and "SAD" is a Noun, so it picks "SAD" the noun. I am referring to 
the function getWordsFromLexResult() which fetches the lex records and compares 
against the category. 

Original comment by manish.s...@gmail.com on 6 Jun 2011 at 3:42

GoogleCodeExporter commented 8 years ago
Since we are not testing simplenlg with the lexicon held in an MS SQL DB, I 
suspect this kind of issue will keep on arising.  I've discussed this with the 
specialist lexicon people, their advice (which I agree with) is to set up MS 
SQL (at the DB or table level) to do case-sensitive matching for the lexicon

Original comment by ehud.rei...@gmail.com on 8 Jun 2011 at 2:32