knowitall / taggers

Easily identify and label sentence intervals using various taggers.
11 stars 12 forks source link

Create TypePatternTagger to ease tagging types #12

Open schmmd opened 11 years ago

schmmd commented 11 years ago

Hi John, how about we do move your contribution into Taggers. Often I just need to think of a good way it fits in--any help is appreciated ;-)

Maybe we can create a new tagger called TypePatternTagger. Maybe you can think of a better name. This tagger would perform a substitution for the type matching syntax. Do you have any suggestions? I thought of <<TypeName>> but I only somewhat like it. I think it would need to create the sequence <typeStart='TypeName'> <typeCont='TypeName'>*.

With this new tagger, we could have patterns such as:

<<VerbPhrase>> <<NounPhrase>> <pos='JJ'>

What do you think? Any chance you could look at this on Monday? I think it would be pretty straightforward and it would get you used to my changes.

schmmd commented 11 years ago

Nope, didn't get an e-mail when it was opened. It seems to me that the replacement would need to be

( < typeStart='x' & typeEnd='x'> | ( <typeStart='x'> <typeCountinue='x'>* <typeEnd='x'>) )

I'll look at this in the afternoon, I'm trying to get Dan some Entity Linking results on different data.

John

schmmd commented 11 years ago

I think they are the same because it's greedy. Note typeCont means not typeStart (but it could be typeEnd too).

jgilme1 commented 11 years ago

lazy val typesContinuingAtToken = types -- typesBeginningAtToken -- typesEndingAtToken

but I guess we could change that.

jgilme1 commented 11 years ago

OH, woops, I must have an older version.

jgilme1 commented 11 years ago

I agree the replacement pattern you suggested should work.

schmmd commented 11 years ago

Yeah, I changed it and it's confusing. Do you think the current definition is OK? It seems better than the old one to me (typeCont just means that were on a token where the type is continuing).

On Mon, Sep 30, 2013 at 9:09 AM, John Gilmer notifications@github.comwrote:

OH, woops, I must have an older version.

— Reply to this email directly or view it on GitHubhttps://github.com/knowitall/taggers/issues/12#issuecomment-25377020 .

jgilme1 commented 11 years ago

The definition seems fine.

I think <> is ok, but I've come to the think of "<>" as meaning token, what other characters are at our disposal?

{VerbPhrase} 'VerbPhrase' ^VerbPhrase^

schmmd commented 11 years ago

Let's do {VerbPhrase} but you will want to be careful because it's also a regular expression syntax. I think you will need to:

  1. Split by whitespace.
  2. See if a token matches '{.*}'.r and perform the substitution if there is a match.
  3. Join back together on space.

Example pattern (to make sure we still like it!):

{VerbPhrase} {NounPhrase} <postag='JJ'>

Fyi backticks put your text in code mode. Argh, but they don't work when sent as an email!

schmmd commented 11 years ago

Argh... this was a horrible suggestion. You can't split by whitespace!