Different tokenisation results between oniguruma and RE2

bzz commented 5 years ago

Right now envy uses regex-based tokeniser (until #218 at least).

We have 2 drop-in replacement regex engines: default from Go RE2 and default from Ruby oniguruma.

Recent improvements done with Go module migration #219 surfaced a new issue: it seems that tokeniser produces a bit different results, depending on which regex engine is used :/

More specifically, the token frequencies built from linguist samples are different and high-level code-generator test catch by comparing with a fixture (pre-generated with RE2) and fail on oniguruma profiles like this https://github.com/src-d/enry/pull/219#issuecomment-482525632

We need to find the exact reason and depending on it decide, if we want to support 2 versions of fixtures or change something so there is no difference in output.

This also potentially affects #194

creachadair commented 5 years ago

Often I have found the differences between engines on the same regexp boil down to small variations in the handling of greed (in particular: whether operators are greedy by default), zero-width assertions, anchoring (e.g., whether it is on by default), case-sensitivity, and other meta-properties of the execution model that aren't well-specified. Obviously I don't know what the causes are in this case yet, but that's where I would start looking for examples.

bzz commented 5 years ago

Steps to reproduce

$ go test ./internal/code-generator/... -run Test_GeneratorTestSuite -testify.m TestGenerationFiles
ok      gopkg.in/src-d/enry.v1/internal/code-generator/generator    35.537s

$ go test -tags oniguruma  ./internal/code-generator/... -run Test_GeneratorTestSuite -testify.m TestGenerationFiles
FAIL    gopkg.in/src-d/enry.v1/internal/code-generator/generator    34.583s

Both should produce the same results if results of tokenisation of all linguist samples are the same.; Going to add extra tests to verify just that.

bzz commented 5 years ago

Tracked the problem down to the difference in handling what seems to be latin1 encoded file. While tokenising thå filling encoded in latin1 and read as UTF-8

RE2 gets

th
filling
�

Oniguruma gets

th
illing
�
f

flex-based tokenised from #218 gets

th
filling

creachadair commented 5 years ago

Of the three, only Flex really seems to be doing anything reasonable here. RE2 is close—but the order of the tokens is weird. I have no idea what Oniguruma is doing there, but it seems obviously broken in at least two ways.

bzz commented 5 years ago

Yup. And if the content is decoded from latin1 and encoded to utf8 with charmap.ISO8859_1.NewDecoder().Bytes(content) the result is an expected one:

RE2

th
filling
å

oniguruma

th
filling
å

That would not bother me much, if 2 months ago Linguist did not add such case to their samples, and that is what content classifier is trained on and it is something we keep a "gold standard" results on, as part of our test fixtures.

I know that Linguist does use ICU-based character encoding detector https://github.com/brianmario/charlock_holmes but am not sure yet if it's part of the tokenisation.

creachadair commented 5 years ago

Yeah, I think either we should normalize the encoding or find a way to treat the Unicode replacement character as part of the token, e.g., xxx�yyy ⇒ xxx�yyy, or use it as a separator and discard it, e.g. ⇒ xxx � yyy ⇒ xxx yyy. It seems like Linguist does the latter maybe.

bzz commented 5 years ago

True. And linguist with flex-based tokeniser does not have this issue, so no need to encoding detection there. Thank you for suggestions, let me think a bit more about that..

bzz commented 5 years ago

After digging deeper - ot seems that the offending tokenization rule is extractAndReplaceRegular(content) that does [0-9A-Za-z_\.@#\/\*]+.

The extractRemainders is then called on it's output and does `bytes.Fields(content),

in RE2 case it's " \xe5 " which results in extra �
in Oniguruma it's "\xe5 f"

bzz commented 5 years ago

For the record: doing equivalent operation in Ruby where regex are backed by Oniguruma lib results in ArgumentError: invalid byte sequence in UTF-8

$irb

"th\xdd filling".scan(/[0-9A-Za-z_\.@#\/\*]+/)

bzz commented 5 years ago

Digging a little bit deeper with oniguruma's C API using awesome examples, it starts to look like this may be a bug in go-oniguruma Regexp.findAll() implementation 🤔

bzz commented 5 years ago

Even in C, oniguruma does consistently produce strange result in UTF8 mode for the non-valid input bytes in UTF8, like above. Seems like a possible bug upstream.

As all regex for tokenization in enry are not using any Unicode character classes, all RE2-based matches are conducted in ASCII-only mode, while go-oniguruma has hardcoded UTF8.

For our use-case the fix would be to force Oniguruma also use ASCII mode and that indeed produces identical results even for non-valid bytes in UTF8.

I will submit a patch to cgo part of https://github.com/src-d/go-oniguruma to have an option to override hardcoded UTF8, expose it in Go as MustCompileWithEncoding (similar to MustCompileWithOption) and as soon as it's merged, move entry regex/oniguruma.go to use it in https://github.com/src-d/enry/pull/227.

Meanwhile, only a better test case was added in there https://github.com/src-d/enry/pull/227/commits/a724a2f8416e6edd6ab1e429fcf5ef452dae6c4d

bzz commented 5 years ago

Fixed by #227

src-d / enry

Different tokenisation results between oniguruma and RE2 #225