plone / plone.i18n

Text normalization logic and language, country, cctld data.
8 stars 11 forks source link

Tests for Japanese normalizer are unstable #25

Closed mauritsvanrees closed 5 years ago

mauritsvanrees commented 5 years ago

On Jenkins, sometimes the tests pass and sometimes they fail. See for example this Plone 5.2 Python 3.6 job:

File ".../plone.i18n-4.0.2-py3.6.egg/plone/i18n/normalizer/ja.py",
 line 61, in plone.i18n.normalizer.ja.Normalizer
Failed example:
    len(normalized)
Expected:
    6
Got:
    5

Or another example, where apparently the above test goes fine, but the almost same test right below it goes wrong, like in this Plone 5.2 Python 3.7 job:

File ".../plone.i18n-4.0.2-py3.7.egg/plone/i18n/normalizer/ja.py",
 line 66, in plone.i18n.normalizer.ja.Normalizer
Failed example:
    len(normalized)
Expected:
    8
Got:
    7

It might be something Python3 specific, as I don't currently see recent failures in Python 2.7 jobs. Maybe the Python 2.7 were lucky so far this year. It might also depend on which machine is running the jobs. Locally Python 3.6 is fine for me.

Hard to debug, as I don't know Japanese, and there is no indication what the expected normalized version would look like. I have a fix ready to make the tests more verbose in case of an error.

uwaiszaki commented 5 years ago

@mauritsvanrees . Tests still failing. https://jenkins.plone.org/job/pull-request-5.2-3.7/230/testReport/junit/plone.i18n.normalizer/ja/Normalizer/?auto_refresh=false

mauritsvanrees commented 5 years ago

This is good. Now we finally see what the wrong normalized word is in Japanese. It is ‘2xm5gfy’ which is 7 characters and we expect 8. This does not tell me anything. At this point we would need someone who knows Japanese to help debug this. Do you know Japanese? Maybe @terapyon can help?

terapyon commented 5 years ago

I will soon check it.

ale-rt commented 5 years ago

The problem is here: https://github.com/plone/plone.i18n/blob/0995a66d4025357f2ab8e035141687fac7061ab7/plone/i18n/normalizer/ja.py#L13-L20

The hash changes at each test run, giving random results. In some cases this will lead to a shorter normalized word.

mauritsvanrees commented 5 years ago

It is probably fine to make the tests less strict, accepting 5 or 6 when we currently expect 6. In some calls we explicitly pass a length, not sure if it matters when we get one more or less. The tests may be way too explicit.

ale-rt commented 5 years ago

Actually the parameter is called max_length https://github.com/plone/plone.i18n/blob/0995a66d4025357f2ab8e035141687fac7061ab7/plone/i18n/normalizer/ja.py#L74 That makes your suggestion fit the code. What I am worried about is that on Python 2 the hash is reproducible, on Python 3 it is not:

[ale@emily ~]$ python2 -c "print(hash('a'))"
12416037344
[ale@emily ~]$ python2 -c "print(hash('a'))"
12416037344
[ale@emily ~]$ python3 -c "print(hash('a'))"
2835895958318421574
[ale@emily ~]$ python3 -c "print(hash('a'))"
7109848119625085005

unless you disable randomization:

[ale@emily ~]$ PYTHONHASHSEED=0 python3 -c "print(hash('a'))"
-7583489610679606711
[ale@emily ~]$ PYTHONHASHSEED=0 python3 -c "print(hash('a'))"
-7583489610679606711

PR coming with the change you suggested.

terapyon commented 5 years ago

I understood. This test case is rare pattern. The _gethashed should be return less than MAX_LENGTH ASCII characters. Almost case will be get just MAX_LENGTH ASCII characters.

Python 3 hash function was changed, but this implements no problem for Japanese user.

I recommend we will change test character, it means the bellow.

Now

text = u"テストページ"

New

text = u"公開テストページ"
terapyon commented 5 years ago

@mauritsvanrees @ale-rt Do you think it?

ale-rt commented 5 years ago

Thanks @terapyon for your comments, it is really valuable! I merged https://github.com/plone/plone.i18n/pull/27 and I will make a new PR to change the tested string and ask you to review.

gforcada commented 5 years ago

Let's see if the changes from @ale-rt, thanks to @terapyon input, fixes it! A truly worldwide effort :wink: