Closed alerque closed 7 years ago
This appears to be a problem with the $CHAPTER
variable, not $TO
like I was expecting. An entry for bö.
seems like the likely culprit in this case.
Yes, the \w probably isn't adequate for your needs here since it won't match the ö: https://github.com/alerque/Bible-Passage-Reference-Parser/blob/master/src/tr/regexps.coffee#L23
As a workaround, you could potentially use the pre_book regexp on line 37 of that file instead of \w.
Aaa, I see what you mean. On the other hand why are we using (?!\w)
here (a negative look ahead for a word character) rather than (?=\W)
(a positive look ahead for a non-word character) or better yet, just \b
to match on a word boundary? Wouldn't that avoid this scenario?
In Javascript, I believe \w, \W, and \b are all ascii-based and would produce equivalent results, wouldn't they? /b\b/ would still match the "b" in "bö".
I believe \w, \W, and \b are all ascii-based and would produce equivalent results, wouldn't they?
Empirical testing says yes, they would.
I still haven't figured out what can go there though. Even that giant class as suggested isn't doing the trick.
What about:
| [b-e] (?! [\wö] )
To solve the immediate problem...
It looks like that expression isn't even the issue. It's matching b[öo]
with no look-ahead at all earlier in the expression. Even removing the $AB
bit entirely doesn't help.
OK, then here's the next step: https://github.com/alerque/Bible-Passage-Reference-Parser/blob/master/src/tr/grammar.pegjs#L135
Adding ö to the ![a-z]
on that line should prevent the grammar from picking it up.
I'm processing the entire text of books (in Markdown format) through this parser to normalize reference formats. I just ran into a couple cases like this:
This is turning up a match for
Eph.5.18-Eph.6.9,Col.3.16-Col.4.1
as expected, but the matched string is off. If you extract the original string between the indices for the second match, the match that comes back actually includes the first letter of the text word:This is trouble because when I go to replace that bit of the string I'm replacing what I get back from the formatter after giving it the OSIS is a properly formatted reference:
All is well except that the next word
bölümlerinde
has just becomeölümlerinde
.How do I keep the ability to match fragment references in the language settings yet not have it be so greedy? Shouldn't the
$AB [b-e]
var be matched with a word boundary restriction at the end?