tomlm / lucy

Language understanding library
MIT License
2 stars 2 forks source link

entity extraction question #2

Open msmsf opened 3 years ago

msmsf commented 3 years ago

hi,

i hope to extract entities from below examples:

i tried several ways of writing the Lucy patterns but it looks i can't handle these two cases at the same time. whenever there is the pattern "what did (speaker:___)+3 think of (x:___)*", the pattern "what did (speaker:___)+3 think of (x:___)* when discussing (y:___)*" will be ignored. is there a way to resolve this? thanks!

details of trials: image

tomlm commented 2 years ago

Wow, First, thank you for trying out my library, and Second, sorry that I didn't see your issue until now.

Here's the basic issue. wild card patterns with * are very greedy, and actually are ambiguous from a human interpretation standpoint.

So what to do you do? I think what you want to do is to make 2 recognition calls to Lucy ., where the 2nd call is the wild card recognition of the first to evaluate if there is additional matches.

entities:
  - name: '@whatSpeaker'
    patterns:
      - what did (the)? (speaker:___)+3 think (of|about) (topic:___)+* 

Sample statement:

what did jon smith think of topic xyz when discussing topic pdq

First call returns the

Now the trick is you want to further interpret the open ended wildcard entity @topic, so you run it through a different Lucy model to further disambigiouate.

  - name: '@subtopics'
    patterns:
      - (x:___)* when discussing (y:___)+*

giving

==== subtopics (3)
topic xyz when discussing topic pdq
^_______^                           @x
                          ^_______^ @y
^_________________________________^ @subtopics

 @subtopics [0,35] @y,@x
    =>  @x [0,9] 'topic xyz' Resolution:"topic xyz"
    =>  @y [26,35] 'topic pdq' Resolution:"topic pdq"

In short, here's the rule of thumb

You probably should only be modeling one open ended wildcard at time. If you need further disambiguation then run it through a separate model.