Out-of-vocabulary handling

slavaGanzin commented 4 years ago

Hello.

I'm working on NER for free-grammar texts (news articles) and it works great, until out of vocabulary word occurs in target text. I understand that PEG parsing designed for strict languages and this is a bit of overstretch. But it is what it is

For example, having this meta-model:

Model:
  date*=Date
  oov+=OOV
;

Date: date=/\d{1,2}[\/\.]\d{1,2}[\/\.]\d{2}(\d{2})?/;
OOV: /\S+/;

parses: 01.01.2020 out of vocabulary 02.02.2020

as: {'date': ['01.01.2020'], 'oov': ['out', 'of', 'vocabulary', '02.02.2020']}

while expected(by me) result would be: {'date': ['01.01.2020', '02.02.2020'], 'oov': ['out', 'of', 'vocabulary']} Because in my understanding parser should try specific rules first and than fallback to broader ones. And it clearly just repeats itself.

I've read documentation and play a lot of time with skipws, ws, and eol. And had no acceptable result, I'm really desperate, so I'll ask you to postpone your rigorous mathematical mind and help with practical advice.

goto40 commented 4 years ago

This is tricky, but maybe this answer helps. Text is greedy, thus placing the more special rule (dare) in front is a good idea.

Then you can use the fact, that multiple assignments lead to a list (https://textx.github.io/textX/stable/grammar/#multiple-assignment-to-the-same-attribute) and use the following:

import textx
g=r'''
Model: (date=Date|oov=OOV)*;
Date: date=/\d{1,2}[\/\.]\d{1,2}[\/\.]\d{2}(\d{2})?/; OOV: /\S+/;
'''
mm=textx.metamodel_from_str(g)
m=mm.model_from_str(r'''
01.01.2020 out of vocabulary 02.02.2020
''')

print(m.date)
print(m.oov)

However, this will also allow an empty model or a model without oov...

igordejanovic commented 4 years ago

A related issue is #119

slavaGanzin commented 4 years ago

However, this will also allow an empty model or a model without oov...

I'm parsing free grammar so it's expected to not have some parts

@goto40 Thanks for your help. Really appreciate it!

textX / textX

Out-of-vocabulary handling #236