Closed michaloo closed 8 years ago
@michaloo Thanks for the report. I identified the problem, it will be resolved shortly.
I've changed mb_strlen
to strlen
in the following line:
https://github.com/thunderer/Shortcode/blob/master/src/Processor/Processor.php#L68
and it solved this specific case. I didn't tested other cases.
@michaloo that specific case is too simple, the problem is in line 75. Function substr_replace()
has no multibyte equivalent and while the shortcodes are parsed correctly the replacement mechanism leaves unwanted characters because it can't count the length correctly.
@michaloo I added some tests and fix in 14c8b6d , could you please test if dev-master
works for you so that I can release 0.5.1
?
I've just tested it and it works as expected, it solves the problem.
Big thanks for fast reaction. Dzięki :)
@michaloo No problem, I wonder why anyone using this library didn't spot that earlier. :) Probably none of them inserted anything multibyte in the shortcode tags. Many thanks for reporting the issue and making this library better. I just released v0.5.1
. I'm closing the issue but we can continue discussion below.
Jeśli masz jakieś inne pomysły na ulepszenie tej biblioteki to daj znać, chętnie o nich porozmawiam. BTW Może będziesz na tegorocznym PHPConie? :)
I don't have a big experience in working with multibyte strings, but when I started to add more shortcodes/tags I got similar, buggy results. I forked you repo and added a test case which caused the problem in our application: https://github.com/stowarzyszenia-wiosna/Shortcode/blob/master/tests/ProcessorTest.php#L67
When I run phpunit all tests passed, so library worked correctly, but I couldn't make it working in the application (Apache, Code Igniter, MySQL). I switched to plain strlen
instead of mb_strlen
and it started to work in application and in phpunit run - everything is commited to fork repo.
I'm using same php configuration to run application and phpunit test suite. I will test it further to diagnose exact source of issue in our application and I will let you know what I found.
W tym momencie Shortcode zaspokaja wszystkie nasze potrzeby, wdrażam je do posiadanych przez nas CMSów, żeby umożliwić stylowanie i budowanie struktury treści (jak w WP), z czasem pewnie będziemy mocniej ją wykorzystywać i może coś się nam nasunie :) W każdym razie bardzo chętnie będę w kontakcie. Nie, nie wybieram się na PHPCona - to ten weekend? Ty będziesz?
@michaloo Looking at the diff of changes in your fork I see that you removed my fix (that line with mb_substr()
you commented out in Processor
class), can you restore it and then test the result? As I said previously, that call to mb_strlen()
calculates the length of string to replace, you can't change it to strlen()
because that length would be incorrect for multibyte string. I have one idea though, are you sure that those strings you process are encoded as UTF-8? Because mb_*()
functions expect the content to be UTF-8 by default and if not please try mb_convert_encoding()
or even iconv()
before passing the string to Processor::process()
. If you have strings encoded as ISO-8859-2 or Latin-1 then they are not technically multi-byte as the special characters are still in the ASCII range. The other things about your environment seem unrelated (Apache, CI, MySQL).
Cieszę się, że mój kod się Wam przydaje. Tak jak napisałem wyżej, sprawdź kodowanie znaków, bo być może to jest problem. Jeśli nie uda się inaczej to możemy się umówić na jakiegoś Skype'a i spróbować to razem rozwiązać. Tak, jestem aktualnie na PHPConie. :)
Update
I tested your test case with forcefully converted string to ISO-8859-2 and with minor changes (regex without u
modifier) it also works correctly (save for the unreadable characters). Can you share the details of your environment like where the strings come from (database encoding, connection settings and so on), are they somehow processed before passing to Shortcode, maybe you have an isolated script in which Shortcode fails to work properly? Anything like that could help me investigate the issue.
As outlined in the original PR comment: https://github.com/thunderer/Shortcode/pull/26#issuecomment-173325927, I was testing the Shortcode
against some documentation I have for Grav CMS where I have created a Tab shortcode. The first issue I noticed was with RegexParser
causing corruption throughout the page.
I tracked down the issue to a stray UTF-8 character that was pasted into the content at some point. I never noticed this as it was rendering fine in editor and browser with PHP 5.6 and 7.0, but under PHP 5.5 where UTF-8 is not the standard encoding, I noticed it displayed strangely.
I have put together a simple test scenario (https://github.com/rhukster/shortcode-test) that shows this issue. If you switch the parser between RegularParser
(which works fine) and the RegexParser
or Wordpress Parser
you will see the corruption caused by that single UTF-8 charater.
NOTE: You can't test with the full document as it causes the
RegularParser
to die (see issue #27)
@rhukster I managed to track down the bug, you can see the diff https://github.com/thunderer/Shortcode/compare/69b260050bf28c3307205d768d0a5434517fae03...549030899cb1851eeab3131ea2a5cec691359432 :
RegularParser
didn't have UTF-8 encoding qualifier in mb_strpos()
,RegexParser
I found out that PREG_OFFSET_CAPTURE
does not return offset correctly, it just reports number of bytes from the start of the string. I need to calculate the offset by calculating number of characters using mb_strlen()
in part of the string cut by standard substr()
up to the offset reported from regex. Yeah, I know. Just for the reference, see this StackOverflow answer and this PHP bug ,WordpressParser
but I'm not sure if it should be implemented. This parser was meant to be copied verbatim from WordPress with all its advantages and shortcomings to provide the same experience as in WordPress. I need to investigate if there is any difference and act accordingly.I'm tempted to consider this issue resolved as the build is green and I verified results with shortcode-test
repository, but please give me your insight before I tag a new release.
Also @michaloo, could you please test your inputs once again?
I will test your fix today also.
I guess on the Wordpress parser it depends on what you intend for that parser? Obviously people with WordPress are going to use the parser in WordPress. You can't just take a WordPress shortcode plugin and just run it on some other platform simply because the shortcode is handled the same. All the events and functions are different.
My personal opinion is that it enables people to 'port' their WordPress plugins to other platforms, but frankly i'm trying to get away from all the bugs and crappy code in WordPress, so i don't want it's bugs to follow me! Please fix the Wordpress parser too because I want to maintain WordPress syntax in my plugins (to attract WordPress devs and users to Grav CMS), but I don't want it to be buggy.
@rhukster Have you tested the fix on dev-master
? Can I tag a new release?
As for the WordpressParser
, the whole idea for it was to "not fix" the issues WordPress has, because if you want fully correct regex parser, there is... RegexParser
. :) Of course the copied code was adjusted to conform to what I wanted to achieve but you're right, I probably can't fully imitate its environment with all the quirks and gotchas. I guess then that I'll only fix the issues in which the parser behavior is different than in WordPress itself. I'd probably add a note in the parser's class and in the documentation to use this parser only if you want the WordPress' behavior.
WordPress and Regex parsers are now working as expected for me! Thanks!
Thanks for your response - as for the fix for WordpressParser
I came to the conclusion that offsets are not even part of how WordPress handles shortcodes so I also fixed the offset problem there. I contacted @michaloo and I hope he replies here and confirms the proper behavior so that we can close the issue for good.
I've tested v0.5.3
and it solves the problem we experienced in our systems.
Hey,
I'm testing following assertion based on README example:
it works fine unless I put some unicode characters inside shortcode.
In case of polish character
ń
it returns'x {"name":"sample","parameters":{"arg":"val"},"content":"\u0144","bbCode":null}] y'
(notice closing]
). In case of 4 polish charactersżółć
it returns'x {"name":"sample","parameters":{"arg":"val"},"content":"\u017c\u00f3\u0142\u0107","bbCode":null}ple] y'
(noticeple]
after shortcode replacement).I'm using
PHP Version 5.5.9-1ubuntu4.11
with Multibyte Support enabled.