knowitall / taggers

Easily identify and label sentence intervals using various taggers.
11 stars 12 forks source link

Handle recursive types #1

Closed schmmd closed 11 years ago

schmmd commented 11 years ago

It would be great if taggers could be recursive. I.e. we first tag "Number" and then when we make a date tagger we can refer to tokens that span "Number".

schmmd commented 11 years ago

This would be an addition to the PatternTagger. The OpenRegex library must operate over a sequence of tokens. Presently, this is a Seq[Lemmatized[ChunkedToken]]. To use type information, we would need this to operate over a Seq[TypedToken].

case class TypedToken(token: Lemmatized[ChunkedToken], types: Set[Type])

The types collection would need to contain all types that overlap that token position. It might be additionally helpful to know which types start and end on this particular token, so an index might need to be stored as well (Type has a token interval so we can compute this if we have the token index).

We would then need an override for findTags that takes a Seq[Lemmatized[ChunkedToken]] and a collection of the types found in the sentence so far. From this we would need to build a Seq[TypedToken] and rework the OpenRegex wrapper to use this additional information. This will be a small slowdown because we need to create new object for each token, but it shouldn't be major. We might want to think about the cost of running multiple pattern taggers on a single sentence however--or at least keep it in mind.

In the regular expression language we will add additional aspects of the token to work on. For example, presently we have string and postag but we will have type=Person which is true if any type with the descriptor Person overlaps the token. We might also want to be able to specify typeStart=Person and typeEnd=Person.

schmmd commented 11 years ago

John can you close this out?