strange chars �� - Githubissues

desb42 commented 4 years ago

the page en.wiktionary.org/wiki/齧り付く looks like strange1 On investigation Module:ja seems to be the culprit (its looking like another regex issue)

This module contains the function

function export.rm_spaces_hyphens(f)
    local text = type(f) == 'table' and f.args[1] or f
    text = str_gsub(text, '.', { [' '] = '', ['-'] = '', ['.'] = '', ['\''] = '' })
    text = str_gsub(text, '&nbsp;', '')
    return text
end

It is the first str_gsub that is causing the problem

Looking at StringLib.java (in luaj sources), it seems that UTF-8 chars are not processed properly

desb42 commented 4 years ago

When I say regex, I did not realise that there were (at least two) expression parsers (lua and PHP)

string.gsub is a lua version of 'regex' - as it happens, it looks like it is a byte oriented parser (it does not know, or care about utf-8)

ustring.gsub on the other hand does

gnosygnu commented 4 years ago

When I say regex, I did not realise that there were (at least two) expression parsers (lua and PHP)

string.gsub is a lua version of 'regex' - as it happens, it looks like it is a byte oriented parser (it does not know, or care about utf-8)

ustring.gsub on the other hand does

Yup, good point :+1:

Just to add a detail to it.

For the LUA parser, XOWA uses LUAJ. This is a fairly close port of the Lua C code, but in Java. Note that LUAJ (and LUA) know nothing about UTF-8.
For the PHP parser, XOWA tried to use Java, but abandoned it due to no support for "balanced" regexes (see #413). As such, it "hacks" it by using the LUAJ parser, but adding additional classes to support UTF-8.

Not sure what's going on above, but figure it's probably related to the LUAJ-AS-PHP regex engine

desb42 commented 4 years ago

From what I can see in Match_state.java line 112

repl = LuaValue.valueOf(src.Substring(src_pos, str_end)); // keep original text

there is an attempt to build a utf-8 string and then to deserialise it again However there is only one byte involved In this case it should effectively be a single byte transferred (that is, not touched at all)

gnosygnu commented 4 years ago

Thanks: you pretty much nailed the issue.

Still investigating though. The problem is this line in https://en.wiktionary.org/wiki/Module:ja: text = str_gsub(text, '.', { [' '] = '', ['-'] = '', ['.'] = '', ['\''] = '' })

luaj_xowa is expecting str_gsub (which is an alias for string.gsub) to work on byteArrays. Instead, I think that it needs to work on charArrays. Let me try changing LuaString's implementation methods of Char_source and see if that works.

gnosygnu commented 4 years ago

Fixed with commit above. The issue is actually a different version of
https://github.com/gnosygnu/xowa/issues/504#issuecomment-513633514

Basically, this is a weird edge-case when trying to process multi-byte character strings (¢):

Lua processes multi-byte character strings one byte at a time, even though each single byte is invalid.
- This usually doesn't work, like when doing gsub("¢", ".", "a") which will output aa
- This sometimes works though, like when doing gsub("¢", ".", someLuaTable) which does output "¢"
Though it does work with the tableReplacement, there will be an intermediate moment when ¢ is rendered into Strings of {-62} and {-94}. Unfortunately, Java will take these invalid single-byte strings and convert them to the � token above (or new byte[] {-17, -65, -126})

Hopefully the above fix captures all instances of the edge case.

At any rate, thanks again for narrowing it down to line 112 above. Saved a lot of work!

desb42 commented 4 years ago

Just realised that there is still a small issue Again looking at en.wiktionary.org/wiki/齧り付く there is a second Etymology (further down the page) And it seems there are still some 'funny' characters

desb42 commented 4 years ago

I think this effect is related Page commons.wikimedia.org/wiki/Peru gives peru1 In the depths is a call to Module:MakeSortKey, where things are happening Line 763-766

local snd = lower(toNFKD(label))
    :gsub('[\192-\223].', substLower)
    :gsub('[\224-\239]..', substLower)
    :gsub('[\240-\247]...', substLower)

I suspect there is an issue of switching between bytes and utf-8

gnosygnu commented 4 years ago

Thanks for the extra pointer. Let me take a look at that in the next day or two.

FWIW, I tried a few nights ago with en.wiktionary.org/wiki/齧り付く and noticed some odd results

Basically, the unknown char is dependent on earlier statements which seems like some sort of caching issue.

Specifically, if I reduce the above page to this wikitext...

{{ja-kanjitab|かぶ|つ|yomi=kun}}
{{ja-go-ku|齧り付|かじりつ}}
{{ja-verb|かぶりつく|type=1|tr=intrans}}
{{ja-go-ku|齧り付|かぶりつ}}

... I get the unknownChar in the last statement.

However, if I remove any of the preceding 3 lines, the last statement no longer produces unknownChars.

I spent an hour or so trying to debug this, and didn't get anywhere. I'll try again with the Peru example above.

Thanks!

desb42 commented 4 years ago

That particular type of behaviour is what led to my suggestion in #750 (no caching) With that in place, the second etymology comes out OK

But does not explain Peru

gnosygnu commented 4 years ago

Ah. Didn't realize that this was directly related to caching.

I tried to debug it further and made no progress. However, I did a quick sync of the current XOWA LuaString with the latest LUAJ LuaString file and it worked. I've attached it below if you're curious.

Note that it doesn't fix #750 though. That said, I'll probably formalize the sync sometime tomorrow and make sure it doesn't break any existing tests.

LuaString.zip

gnosygnu commented 4 years ago

Resolved with the commit above. I should have put in a unit test, but I couldn't find a simple case. Sync'ing with LuaJ's current LuaString is the way to go though.

Thanks again for re-opening! Hopefully it won't happen again :)

gnosygnu / xowa

strange chars �� #735

gnosygnu / xowa

strange chars ������� #735

strange chars �� #735