seboettg / citeproc-php

Full-featured CSL 1.0.1 processor for PHP
MIT License
73 stars 38 forks source link

Locale-specific string functions should not be used to build classnames #93

Closed asmecher closed 3 years ago

asmecher commented 3 years ago

Please follow the general troubleshooting steps first:

Bug reports:

Factory.inc.php uses ucfirst to assemble a constraint class name for inclusion. However, it uses the ucfirst function to do so, and upper/lower case behaviour in the current locale may not match the code's expectations.

For example, when using the tr_TR locale, ucfirst('i') yields "i", leading to:

PHP message: PHP Fatal error:  Uncaught Seboettg\CiteProc\Exception\ClassNotFoundException: Class "Seboettg\CiteProc\Constraint\isNumeric" could not be found. in .../citeproc-php/src/Constraint/Factory.php:41
Stack trace:
#0 .../citeproc-php/src/Rendering/Choose/ChooseIf.php(67): Seboettg\\CiteProc\\Constraint\\Factory::createConstraint()
#1 .../citeproc-php/src/Rendering/Choose/Choose.php(53): Seboettg\\CiteProc\\Rendering\\Choose\\ChooseIf->__construct()
#2 .../citeproc-php/src/Util/Factory.php(64): Seboettg\\CiteProc\\Rendering\\Choose\\Choose->__construct()
...

(This is a common Turkish locale gotcha; see e.g. https://github.com/smarty-php/smarty/issues/155 and the proposed fix at https://github.com/smarty-php/smarty/pull/586. I have not checked elsewhere to see if other locale-sensitive functions are used in this way in this library.)

Originally reported by an OJS end user: https://forum.pkp.sfu.ca/t/articles-in-the-second-locale-language-do-not-show/62967

Used CSL stylesheet:

(Any)

Used CSL metadata

(Any)

seboettg commented 3 years ago

Hi @asmecher! Thank you for that bug report. It's really strange which strange issues can occur. Thanks also for your code snipped from smarty-php, that helped me a lot.

Unfortunately is very hard to test this issue, since I cannot just change my system locale... In order to test if my fix really works, please checkout the branch issue-93 and run the unit tests on the system with Turkish locales.

asmecher commented 3 years ago

Thanks, @seboettg! Yes, this is a nasty one and pops up all over the place (see also https://bugs.php.net/bug.php?id=18556). I feel sorry for Turkish developers who run into this stuff while trying to learn!

I tried to run the tests, but I'm getting a lot of test failures (on the master branch also) of the form:

TypeError: Argument 1 passed to Seboettg\Collection\ArrayList::__construct() must be of the type array, string given, called in /home/asmecher/git/citeproc-php/src/Rendering/Name/Names.php on line 148
TypeError: Argument 1 passed to Seboettg\Collection\ArrayList::__construct() must be of the type array, object given, called in /home/asmecher/git/citeproc-php/src/CiteProc.php on line 173

Maybe this is fixed elsewhere? (PHPUnit 8.5.2, PHP 7.4.3.)

I suspect you can get the testing to run under the Turkish locale by following these steps (Linux):

  1. Enable and compile the Turkish locale. Edit /etc/locale.gen and add/uncomment the line: tr_TR.UTF-8 UTF-8
  2. Re-generate your locales: sudo locale-gen
  3. Verify that ucfirst behaves differently under Turkish: php -r "setlocale(LC_ALL, 'tr_TR.UTF-8'); echo ucfirst('i');" (This should yield a lower case i instead of a capital I like you expect)
  4. Force tests to run under the Turkish locale. Edit tests/bootstrap.php and add: setlocale(LC_ALL, 'tr_TR.UTF-8');
  5. Run the tests!

I suspect you could add a new test that incorporates the setlocale call, but checks the system ucfirst under tr_TR.UTF-8 to see whether the weird behaviour exists. If it doesn't, then probably the tr_TR locale isn't enabled in the OS (steps 1 and 2) and the test should be skipped.

seboettg commented 3 years ago

Thank you @asmecher I use MacOS and unfortunately I have no other computer, except a raspberry pi with a minimal setup and without PHP. I tried to change the locale on my system, but it doesn't work:

$ php -r "setlocale(LC_ALL, 'tr_TR.UTF-8'); echo ucfirst('i');"
I%                                  

But I think the unit tests might work when you run composer update before. After that you can just run composer test.

...
Time: 22.03 seconds, Memory: 28.00 MB
OK (199 tests, 429 assertions)
asmecher commented 3 years ago

Thanks, that got it. The master branch tests fine, but adding the setlocale to bootstrap.php breaks it as expected. Running the tests on issue-93 still shows some errors, however:

PHPUnit test output from issue-93 with setlocale to tr_TR.UTF-8

seboettg commented 3 years ago

Ah, ok! You must run composer install or composer dump-autoload after switching the branch to rebuild the autoload files.

asmecher commented 3 years ago

Gotcha, but the log file linked above still shows some locale-specific assumptions that Turkish breaks. See e.g. src/Util/NumberHelper.php, where strtoupper is used on content containing roman numerals, and thus i supplied as input won't get converted to I:

1) Seboettg\CiteProc\Rendering\NumberTest::testRomanInputSingleNumber
A non-numeric value encountered

/home/asmecher/git/citeproc-php/src/Rendering/Number.php:146
/home/asmecher/git/citeproc-php/src/Rendering/Number.php:95
/home/asmecher/git/citeproc-php/tests/src/Rendering/NumberTest.php:130

2) Seboettg\CiteProc\Rendering\NumberTest::testRomanInputRangeNumber
A non-numeric value encountered

/home/asmecher/git/citeproc-php/src/Rendering/Number.php:146
/home/asmecher/git/citeproc-php/src/Rendering/Number.php:95
/home/asmecher/git/citeproc-php/tests/src/Rendering/NumberTest.php:137

3) Seboettg\CiteProc\Util\NumberHelperTest::testRoman2Dec
Undefined index: i

/home/asmecher/git/citeproc-php/src/Util/NumberHelper.php:101
/home/asmecher/git/citeproc-php/tests/src/Util/NumberHelperTest.php:35
seboettg commented 3 years ago

In NumberHelper strlen and strtoupper was used. I replaced these functions with mb_strlen and mb_strtoupper. I hope this fix will work.

asmecher commented 3 years ago

Very close!

There were 2 errors:

1) Seboettg\CiteProc\Rendering\NumberTest::testRomanInputSingleNumber
A non-numeric value encountered

/home/asmecher/git/citeproc-php/src/Rendering/Number.php:146
/home/asmecher/git/citeproc-php/src/Rendering/Number.php:95
/home/asmecher/git/citeproc-php/tests/src/Rendering/NumberTest.php:130

2) Seboettg\CiteProc\Rendering\NumberTest::testRomanInputRangeNumber
A non-numeric value encountered

/home/asmecher/git/citeproc-php/src/Rendering/Number.php:146
/home/asmecher/git/citeproc-php/src/Rendering/Number.php:95
/home/asmecher/git/citeproc-php/tests/src/Rendering/NumberTest.php:137

--

There was 1 failure:

1) Seboettg\CiteProc\Util\NumberHelperTest::testIsRomanNumber
Failed asserting that false is true.

/home/asmecher/git/citeproc-php/tests/src/Util/NumberHelperTest.php:41
seboettg commented 3 years ago

Okay... once again, please! 👍

asmecher commented 3 years ago

Down to 1!

There was 1 error:

1) Seboettg\CiteProc\Rendering\NumberTest::testRomanInputRangeNumber
A non-numeric value encountered

/home/asmecher/git/citeproc-php/src/Rendering/Number.php:146
/home/asmecher/git/citeproc-php/src/Rendering/Number.php:95
/home/asmecher/git/citeproc-php/tests/src/Rendering/NumberTest.php:137
seboettg commented 3 years ago

Oh no! I'm so sorry to have to bother you with these tests.

seboettg commented 3 years ago

Ok, I have now set up my raspi so that I can do these tests by myself.

I figured out, that there seems to be a problem with case insensitive regular expressions...

/^([ivxlcdm]+\.*)\s*([*\–\-&+,;])\s*([ivxlcdm]+\.?)$/i

For example this pattern does not work with Turkish locale.

seboettg commented 3 years ago

I guess all issues are solved now. @asmecher can you confirm that?

asmecher commented 3 years ago

Tests pass!

OK (199 tests, 429 assertions)
seboettg commented 3 years ago

Ok, version 2.2.2 is released and contains these fixes. I will close this ticket. If you think additional doings are necessary, feel free to reopen this bug report.

asmecher commented 3 years ago

Thanks for your excellent work as always, @seboettg!