spencermountain / compromise

modest natural-language processing
http://compromise.cool
MIT License
11.5k stars 654 forks source link

Lexicon adjective preferences are ignored in lieu of capitalization #437

Open IntegerMan opened 6 years ago

IntegerMan commented 6 years ago

I discovered this one during unit testing earlier today:

Given that lexicon is a compatible lexicon object containing an entry defining your as Adjective, what I'm finding is that casing matters in how results are interpreted.

If I have an invoke similar to: const data: LanguageTerm[] = nlp('Get the Your Front Yard', lexicon).terms().data(); I find that Your is tagged as Best Tag of Noun, with a TitleCase tag also applied, but no Adjective applied.

On the other hand, if I do an invoke similar to: const data: LanguageTerm[] = nlp('Get the your Front Yard', lexicon).terms().data(); I find that your is parsing as a Best Tag of Adjective.

Docs presently state that all lexicon entries should be in normalized case (lowercase), so I feel that maybe something inside of compromise is disregarding my preferences when it encounters this input.

As to the strange casing and poor grammar of my input, it's text generated by my testing library, given an entry of "Your Front Yard" and attempting to test the failure of a "get" verb on that (and that the engine knows which object I'm referring to). I don't expect inputs as strange as this, but I do think that this is probably a bug on the compromise side of things and it'd be nice to see a resolution or get a workaround.

IntegerMan commented 6 years ago

This was encountered on version 11.2.1

spencermountain commented 6 years ago

hey Matt, yeah you're doing everything right. In that case it's being tagged as a adjective, then over-written by some tagging rules. you can see this if you call nlp.verbose('tagger') before your nlp() call. we just don't have a way to flag lexicon-determined words to trump the other rules, somehow. we should do that!

IntegerMan commented 6 years ago

That's super cool. I've added that to my debug builds to help with diagnostics. For reference, what I'm seeing is:

Input sentence: 'get the Your Front Yard'
compromise.js:325 'your'        ->   TitleCasepunct-rule     
compromise.js:325 'get'         ->   Infinitivelexicon        
compromise.js:325 'get'         ->   PresentTense --> Infinitive
compromise.js:325 'get'         ->   Verb --> PresentTense
compromise.js:325 'get'         ->   VerbPhrase --> Verb      
compromise.js:325 'the'         ->   Determinerlexicon        
compromise.js:325 'your'        ->   Adjectivelexicon        
compromise.js:325 'frontyard'   ->   Singularregex-list     
compromise.js:325 'frontyard'   ->   Noun --> Singular  
compromise.js:325 'your'        ->   Nouncapital-step   
compromise.js:332 'your'        ~*   Adjective    capital-step
compromise.js:325 'your'        ->   SingularpluralStep    

I'm not entirely sure how to take the output into account here as to why things are getting categorized, but it's interesting output. It's also a nice reminder that I'm replacing at a sentence level before sending into nlp so in this case the input sentence was actually Get the Your frontyard. Doesn't really impact the result, but this will help keep me honest for future reports.

As for a workaround to get my test passing, I can check tokens and their normals against my lexicon after NLP has had a go, just as a sanity check.