Closed desb42 closed 4 years ago
When I say regex, I did not realise that there were (at least two) expression parsers (lua and PHP)
string.gsub
is a lua version of 'regex' - as it happens, it looks like it is a byte oriented parser (it does not know, or care about utf-8)
ustring.gsub
on the other hand does
When I say regex, I did not realise that there were (at least two) expression parsers (lua and PHP)
string.gsub is a lua version of 'regex' - as it happens, it looks like it is a byte oriented parser (it does not know, or care about utf-8)
ustring.gsub on the other hand does
Yup, good point :+1:
Just to add a detail to it.
Not sure what's going on above, but figure it's probably related to the LUAJ-AS-PHP regex engine
From what I can see in Match_state.java line 112
repl = LuaValue.valueOf(src.Substring(src_pos, str_end)); // keep original text
there is an attempt to build a utf-8 string and then to deserialise it again However there is only one byte involved In this case it should effectively be a single byte transferred (that is, not touched at all)
Thanks: you pretty much nailed the issue.
Still investigating though. The problem is this line in https://en.wiktionary.org/wiki/Module:ja: text = str_gsub(text, '.', { [' '] = '', ['-'] = '', ['.'] = '', ['\''] = '' })
luaj_xowa is expecting str_gsub (which is an alias for string.gsub) to work on byteArrays. Instead, I think that it needs to work on charArrays. Let me try changing LuaString's implementation methods of Char_source and see if that works.
Fixed with commit above. The issue is actually a different version of
https://github.com/gnosygnu/xowa/issues/504#issuecomment-513633514
Basically, this is a weird edge-case when trying to process multi-byte character strings (¢
):
gsub("¢", ".", "a")
which will output aa
gsub("¢", ".", someLuaTable)
which does output "¢"¢
is rendered into Strings of {-62} and {-94}. Unfortunately, Java will take these invalid single-byte strings and convert them to the �
token above (or new byte[] {-17, -65, -126})Hopefully the above fix captures all instances of the edge case.
At any rate, thanks again for narrowing it down to line 112 above. Saved a lot of work!
Just realised that there is still a small issue
Again looking at en.wiktionary.org/wiki/齧り付く
there is a second Etymology (further down the page)
And it seems there are still some 'funny' characters
I think this effect is related
Page commons.wikimedia.org/wiki/Peru
gives
In the depths is a call to Module:MakeSortKey
, where things are happening
Line 763-766
local snd = lower(toNFKD(label))
:gsub('[\192-\223].', substLower)
:gsub('[\224-\239]..', substLower)
:gsub('[\240-\247]...', substLower)
I suspect there is an issue of switching between bytes and utf-8
Thanks for the extra pointer. Let me take a look at that in the next day or two.
FWIW, I tried a few nights ago with en.wiktionary.org/wiki/齧り付く and noticed some odd results
Basically, the unknown char is dependent on earlier statements which seems like some sort of caching issue.
Specifically, if I reduce the above page to this wikitext...
{{ja-kanjitab|かぶ|つ|yomi=kun}}
{{ja-go-ku|齧り付|かじりつ}}
{{ja-verb|かぶりつく|type=1|tr=intrans}}
{{ja-go-ku|齧り付|かぶりつ}}
... I get the unknownChar in the last statement.
However, if I remove any of the preceding 3 lines, the last statement no longer produces unknownChars.
I spent an hour or so trying to debug this, and didn't get anywhere. I'll try again with the Peru example above.
Thanks!
That particular type of behaviour is what led to my suggestion in #750 (no caching) With that in place, the second etymology comes out OK
But does not explain Peru
Ah. Didn't realize that this was directly related to caching.
I tried to debug it further and made no progress. However, I did a quick sync of the current XOWA LuaString with the latest LUAJ LuaString file and it worked. I've attached it below if you're curious.
Note that it doesn't fix #750 though. That said, I'll probably formalize the sync sometime tomorrow and make sure it doesn't break any existing tests.
Resolved with the commit above. I should have put in a unit test, but I couldn't find a simple case. Sync'ing with LuaJ's current LuaString is the way to go though.
Thanks again for re-opening! Hopefully it won't happen again :)
the page
en.wiktionary.org/wiki/齧り付く
looks like On investigationModule:ja
seems to be the culprit (its looking like another regex issue)This module contains the function
It is the first
str_gsub
that is causing the problemLooking at StringLib.java (in luaj sources), it seems that UTF-8 chars are not processed properly