Unicode character breaks shortcode replacement

michaloo commented 9 years ago

Hey,

I'm testing following assertion based on README example:

        $handlers = new HandlerContainer();
        $handlers->add('sample', function(ShortcodeInterface $s) {
           return (new JsonSerializer())->serialize($s);
           });
        $processor = new Processor(new RegexParser(), $handlers);

        $text = 'x [sample arg=val]cnt[/sample] y';
        $result = 'x {"name":"sample","args":{"arg":"val"},"content":"cnt"} y';
        assert($result === $processor->process($text));

it works fine unless I put some unicode characters inside shortcode.

In case of polish character ń it returns 'x {"name":"sample","parameters":{"arg":"val"},"content":"\u0144","bbCode":null}] y' (notice closing ]). In case of 4 polish characters żółć it returns 'x {"name":"sample","parameters":{"arg":"val"},"content":"\u017c\u00f3\u0142\u0107","bbCode":null}ple] y' (notice ple] after shortcode replacement).

I'm using PHP Version 5.5.9-1ubuntu4.11 with Multibyte Support enabled.

thunderer commented 9 years ago

@michaloo Thanks for the report. I identified the problem, it will be resolved shortly.

michaloo commented 9 years ago

I've changed mb_strlen to strlen in the following line: https://github.com/thunderer/Shortcode/blob/master/src/Processor/Processor.php#L68

and it solved this specific case. I didn't tested other cases.

thunderer commented 9 years ago

@michaloo that specific case is too simple, the problem is in line 75. Function substr_replace() has no multibyte equivalent and while the shortcodes are parsed correctly the replacement mechanism leaves unwanted characters because it can't count the length correctly.

thunderer commented 9 years ago

@michaloo I added some tests and fix in 14c8b6d , could you please test if dev-master works for you so that I can release 0.5.1?

michaloo commented 9 years ago

I've just tested it and it works as expected, it solves the problem.

Big thanks for fast reaction. Dzięki :)

thunderer commented 9 years ago

@michaloo No problem, I wonder why anyone using this library didn't spot that earlier. :) Probably none of them inserted anything multibyte in the shortcode tags. Many thanks for reporting the issue and making this library better. I just released v0.5.1. I'm closing the issue but we can continue discussion below.

Jeśli masz jakieś inne pomysły na ulepszenie tej biblioteki to daj znać, chętnie o nich porozmawiam. BTW Może będziesz na tegorocznym PHPConie? :)

michaloo commented 9 years ago

I don't have a big experience in working with multibyte strings, but when I started to add more shortcodes/tags I got similar, buggy results. I forked you repo and added a test case which caused the problem in our application: https://github.com/stowarzyszenia-wiosna/Shortcode/blob/master/tests/ProcessorTest.php#L67

When I run phpunit all tests passed, so library worked correctly, but I couldn't make it working in the application (Apache, Code Igniter, MySQL). I switched to plain strlen instead of mb_strlen and it started to work in application and in phpunit run - everything is commited to fork repo.

I'm using same php configuration to run application and phpunit test suite. I will test it further to diagnose exact source of issue in our application and I will let you know what I found.

W tym momencie Shortcode zaspokaja wszystkie nasze potrzeby, wdrażam je do posiadanych przez nas CMSów, żeby umożliwić stylowanie i budowanie struktury treści (jak w WP), z czasem pewnie będziemy mocniej ją wykorzystywać i może coś się nam nasunie :) W każdym razie bardzo chętnie będę w kontakcie. Nie, nie wybieram się na PHPCona - to ten weekend? Ty będziesz?

thunderer commented 9 years ago

@michaloo Looking at the diff of changes in your fork I see that you removed my fix (that line with mb_substr() you commented out in Processor class), can you restore it and then test the result? As I said previously, that call to mb_strlen() calculates the length of string to replace, you can't change it to strlen() because that length would be incorrect for multibyte string. I have one idea though, are you sure that those strings you process are encoded as UTF-8? Because mb_*() functions expect the content to be UTF-8 by default and if not please try mb_convert_encoding() or even iconv() before passing the string to Processor::process(). If you have strings encoded as ISO-8859-2 or Latin-1 then they are not technically multi-byte as the special characters are still in the ASCII range. The other things about your environment seem unrelated (Apache, CI, MySQL).

Cieszę się, że mój kod się Wam przydaje. Tak jak napisałem wyżej, sprawdź kodowanie znaków, bo być może to jest problem. Jeśli nie uda się inaczej to możemy się umówić na jakiegoś Skype'a i spróbować to razem rozwiązać. Tak, jestem aktualnie na PHPConie. :)

Update

I tested your test case with forcefully converted string to ISO-8859-2 and with minor changes (regex without u modifier) it also works correctly (save for the unreadable characters). Can you share the details of your environment like where the strings come from (database encoding, connection settings and so on), are they somehow processed before passing to Shortcode, maybe you have an isolated script in which Shortcode fails to work properly? Anything like that could help me investigate the issue.

rhukster commented 8 years ago

As outlined in the original PR comment: https://github.com/thunderer/Shortcode/pull/26#issuecomment-173325927, I was testing the Shortcode against some documentation I have for Grav CMS where I have created a Tab shortcode. The first issue I noticed was with RegexParser causing corruption throughout the page.

I tracked down the issue to a stray UTF-8 character that was pasted into the content at some point. I never noticed this as it was rendering fine in editor and browser with PHP 5.6 and 7.0, but under PHP 5.5 where UTF-8 is not the standard encoding, I noticed it displayed strangely.

I have put together a simple test scenario (https://github.com/rhukster/shortcode-test) that shows this issue. If you switch the parser between RegularParser (which works fine) and the RegexParser or Wordpress Parser you will see the corruption caused by that single UTF-8 charater.

NOTE: You can't test with the full document as it causes the RegularParser to die (see issue #27)

thunderer commented 8 years ago

@rhukster I managed to track down the bug, you can see the diff https://github.com/thunderer/Shortcode/compare/69b260050bf28c3307205d768d0a5434517fae03...549030899cb1851eeab3131ea2a5cec691359432 :

RegularParser didn't have UTF-8 encoding qualifier in mb_strpos(),
while debugging RegexParser I found out that PREG_OFFSET_CAPTURE does not return offset correctly, it just reports number of bytes from the start of the string. I need to calculate the offset by calculating number of characters using mb_strlen() in part of the string cut by standard substr() up to the offset reported from regex. Yeah, I know. Just for the reference, see this StackOverflow answer and this PHP bug ,
I wrote a comment with suggested fix for WordpressParser but I'm not sure if it should be implemented. This parser was meant to be copied verbatim from WordPress with all its advantages and shortcomings to provide the same experience as in WordPress. I need to investigate if there is any difference and act accordingly.

I'm tempted to consider this issue resolved as the build is green and I verified results with shortcode-test repository, but please give me your insight before I tag a new release.

Also @michaloo, could you please test your inputs once again?

rhukster commented 8 years ago

I will test your fix today also.

I guess on the Wordpress parser it depends on what you intend for that parser? Obviously people with WordPress are going to use the parser in WordPress. You can't just take a WordPress shortcode plugin and just run it on some other platform simply because the shortcode is handled the same. All the events and functions are different.

My personal opinion is that it enables people to 'port' their WordPress plugins to other platforms, but frankly i'm trying to get away from all the bugs and crappy code in WordPress, so i don't want it's bugs to follow me! Please fix the Wordpress parser too because I want to maintain WordPress syntax in my plugins (to attract WordPress devs and users to Grav CMS), but I don't want it to be buggy.

thunderer commented 8 years ago

@rhukster Have you tested the fix on dev-master? Can I tag a new release?

As for the WordpressParser, the whole idea for it was to "not fix" the issues WordPress has, because if you want fully correct regex parser, there is... RegexParser. :) Of course the copied code was adjusted to conform to what I wanted to achieve but you're right, I probably can't fully imitate its environment with all the quirks and gotchas. I guess then that I'll only fix the issues in which the parser behavior is different than in WordPress itself. I'd probably add a note in the parser's class and in the documentation to use this parser only if you want the WordPress' behavior.

rhukster commented 8 years ago

WordPress and Regex parsers are now working as expected for me! Thanks!

thunderer commented 8 years ago

Thanks for your response - as for the fix for WordpressParser I came to the conclusion that offsets are not even part of how WordPress handles shortcodes so I also fixed the offset problem there. I contacted @michaloo and I hope he replies here and confirms the proper behavior so that we can close the issue for good.

michaloo commented 8 years ago

I've tested v0.5.3 and it solves the problem we experienced in our systems.

thunderer / Shortcode

Unicode character breaks shortcode replacement #25