Custom Rule (ex Lemmatization)

mjackson / citrus

Parsing Expressions for Ruby

http://mjackson.github.io/citrus

405 stars 28 forks source link

Custom Rule (ex Lemmatization) #43

Closed dbose closed 10 years ago

dbose commented 11 years ago

First of all, awesome work !! I think citrus is very close to pyparsing.

Any idea how I can implement a custom parsing Rule, let's say for lemmatization?

-Cheers Deb

mjackson commented 10 years ago

@dbose Thanks! I've never personally done any work with lemmatization. Do you think PEGs would be a good fit for it?

P.S. Closing this since it isn't really an issue.

dbose commented 10 years ago

Definitely, PEGs are not good for such NLP processes.

I think I phrased it incorrectly. What I meant - Is there a way to match only for lemmatized word. I handled it in following way (it's an hack, only taking care of adjective forms) -

rule pre_modifier_token modifier ('d' | 'ed' | 'ped')* end

For example in pyparsing, I can hook in custom function into the PEG (https://github.com/JoshRosen/cmps140_creative_cooking_assistant/blob/master/nlu/ingredient_line_grammar.py; LemmatizedWord is a custom function)

Another way would be to build Lemmatization and other IE capabilities on top of the PEG. But it would have been excellent to hook custom functions into the stream.

dbose commented 10 years ago

By the way, I'm using citrus in extracting data out of recipes and it's looking great so far. As the domain vocabulary of cooking is rather limited, a ML-based extractor would have been overkill. Thanks again for your work.

I would love to contribute on this (bringing it closer to pyparsing et. al.), and raise a pull-request with my thoughts on what I meant by custom functions.

Cheers Deb

mjackson commented 10 years ago

Ah, thanks for the explanation.

I think you'll probably want to look into subclassing Citrus::Nonterminal to achieve what you're describing. A non-terminal is able to do custom logic that describes matching behavior of other rules. In your case, it sounds like you could possibly create a non-terminal that looks for lemmatization and matches (or doesn't) based on that.

In any case, I'd definitely be interested in seeing a PR that implements this.