microsoft / Cognitive-LinguisticAnalysis-Windows

Windows SDK for the Microsoft Linguistic Analysis API, part of Cognitive Services
https://www.microsoft.com/cognitive-services/en-us/linguistic-analysis-api
Other
30 stars 23 forks source link

Tagging problems leading to entity extraction problems #7

Open ghost opened 6 years ago

ghost commented 6 years ago

I've found what appears to be poor tagging of words which should be tagged nouns or proper nouns in some instances, as verbs or adjectives in others. I believe the utterances I'm using to be fairly straightforward English utterances that the API shouldn't have too much trouble with. It's causing problems in extracting entities from utterances. Take these utterances for example.

For my purposes, each utterance starts with an uppercase letter, and is otherwise forced into lower-case, so the API can't rely on uppercase letters to interpret if a word is a proper noun. We have the API's output, and a further bit of code on our end to assess if the word is an entity. (That relies on the root tag and certain patterns of tags, so we could modify that on our end for a temporary workaround.)

"Call edward smith and vincenzo de campo":

Phrase Root Tag Tags Is Entity
Call VP VP-VP-VB False
edward VP VP-VP-S-NP-NN True
smith VP VP-VP-S-ADJP-JJ True
and VP VP-CC False
vincenzo VP VP-VP-VB False
de VP VP-VP-PP-IN True
campo VP VP-VP-PP-NN True

Here, "edward" is interpreted as a noun. Our system therefore picks it up as an entity along with "smith" as they're found in the right context.

"call edith walker and edward smith":

Phrase Root Tag Tags Is Entity
Call VP VP-VP-VB False
edith VP VP-VP-S-ADJP-RB True
walker VP VP-VP-S-ADJP-JJ True
and VP VP-CC False
edward VP VP-VP-VB False
smith VP VP-VP-JJ True

In this instance, the API doesn't recognise that "and" is extending the initial verb, so it seems to be interpreting "edward" as the verb rather than the noun, and only picking up "smith" as the noun. This is understandable as something the API isn't yet prepared to understand. It would be a lot more helpful if the API could make the connection that "call edith walker and edward smith" truly reads as "call edith walker and call edward smith".

"Book a meeting with edward walker and edith smith":

Phrase Root Tag Tags Is Entity
Book VP VP-NN False
a VP VP-NP-NP-NP-DT False
meeting VP VP-NP-NP-NP-NN False
with VP VP-NP-NP-PP-IN False
edward VP VP-NP-NP-PP-NP-JJ True
walker VP VP-NP-NP-PP-NP-NN True
and VP VP-NP-CC False
edith VP VP-NP-NP-JJ False
smith VP VP-NP-NP-NN False

Firstly, the API doesn't understand that "book" in this context is a verb, like "call" in the previous example. Importantly for the entity extraction, it seems the API is interpreting "edward" as an adjective that should be modifying the following "walker" noun. However, English rules should only apply if it also included the indefinite article, as in, "Book a meeting with [a] blue Walker", so this seems to be an error.

"Register edwin smith as a new employee":

Phrase Root Tag Tags Is Entity
Register ADJP ADJP-ADJP-RB False
edwin ADJP ADJP-ADJP-RB False
smith ADJP ADJP-ADJP-JJ True
as ADJP ADJP-PP-IN False
a ADJP ADJP-PP-NP-DT False
new ADJP ADJP-PP-NP-JJ True
employee ADJP ADJP-PP-NP-NN True

Firstly, the API wrongly reads the verb "register" as an adverb, which would affect how it interprets the following word. Subsequently, we have the name, "edwin", which is also intepreted as an adverb, rather than a noun or proper noun. It's only "smith" which is correctly identified as a noun.

Full Penn Tree Database list: http://web.mit.edu/6.863/www/PennTreebankTags.html