How can I stop partial word matches for verse fragment references?

alerque commented 7 years ago

I'm processing the entire text of books (in Markdown format) through this parser to normalize reference formats. I just ran into a couple cases like this:

Pavlus, Efes. 5:18-6:9’un benzeri olan Kol. 3:16-4:1 bölümlerinde “Mesih’in sözü bütün zenginliğiyle imanlıların içinde yaşarsa”, bu kişilerin aynı şekilde mezmurlar, ruhsal ezgiler ve ilahiler söyleyeceğini belirtmektedir.

This is turning up a match for Eph.5.18-Eph.6.9,Col.3.16-Col.4.1 as expected, but the matched string is off. If you extract the original string between the indices for the second match, the match that comes back actually includes the first letter of the text word:

Kol. 3:16-4:1 b

This is trouble because when I go to replace that bit of the string I'm replacing what I get back from the formatter after giving it the OSIS is a properly formatted reference:

 Koloseliler 3:16—4:1

All is well except that the next word bölümlerinde has just become ölümlerinde.

How do I keep the ability to match fragment references in the language settings yet not have it be so greedy? Shouldn't the $AB [b-e] var be matched with a word boundary restriction at the end?

alerque commented 7 years ago

This appears to be a problem with the $CHAPTER variable, not $TO like I was expecting. An entry for bö. seems like the likely culprit in this case.

openbibleinfo commented 7 years ago

Yes, the \w probably isn't adequate for your needs here since it won't match the ö: https://github.com/alerque/Bible-Passage-Reference-Parser/blob/master/src/tr/regexps.coffee#L23

As a workaround, you could potentially use the pre_book regexp on line 37 of that file instead of \w.

alerque commented 7 years ago

Aaa, I see what you mean. On the other hand why are we using (?!\w) here (a negative look ahead for a word character) rather than (?=\W) (a positive look ahead for a non-word character) or better yet, just \b to match on a word boundary? Wouldn't that avoid this scenario?

openbibleinfo commented 7 years ago

In Javascript, I believe \w, \W, and \b are all ascii-based and would produce equivalent results, wouldn't they? /b\b/ would still match the "b" in "bö".

alerque commented 7 years ago

I believe \w, \W, and \b are all ascii-based and would produce equivalent results, wouldn't they?

Empirical testing says yes, they would.

I still haven't figured out what can go there though. Even that giant class as suggested isn't doing the trick.

openbibleinfo commented 7 years ago

What about:

| [b-e] (?! [\wö] )

To solve the immediate problem...

alerque commented 7 years ago

It looks like that expression isn't even the issue. It's matching b[öo] with no look-ahead at all earlier in the expression. Even removing the $AB bit entirely doesn't help.

openbibleinfo commented 7 years ago

OK, then here's the next step: https://github.com/alerque/Bible-Passage-Reference-Parser/blob/master/src/tr/grammar.pegjs#L135

Adding ö to the ![a-z] on that line should prevent the grammar from picking it up.

openbibleinfo / Bible-Passage-Reference-Parser

How can I stop partial word matches for verse fragment references? #32