textile / php-textile

Textile markup language parser for PHP
https://textile-lang.com
Other
216 stars 44 forks source link

Too relaxed parsing for special characters, here "X" #158

Closed neffets closed 8 years ago

neffets commented 9 years ago

Its fine to parse "4X5" to "4×5".

But its a bug for parsing codes like "DE000A13R6X4" to "DE000A13R6×4".

Please use stricter parsing like "\b\d+[xX]\d+\b" or "\b\d+[xX]\d+\b

netcarver commented 8 years ago

Steffen, will look into this later this week when I look at merging your pull request. Thank you.

netcarver commented 8 years ago

Steffen, I've now had a chance to look at this issue and, unfortunately, the fix will not be as simple as applying word boundary matches to the regex as this causes existing test case failures.

I can think of several additional cases where not applying the dimension glyph may come in useful and these are all in encodings...

I'll experiment some more with ways of possibly matching these before the dimension glyph is applied and report back later.

lauri-elevant commented 8 years ago

Thanks for a wonderful library! We use it extensively!

This issue affects us too with precisely the circumstances described by neffets. Maybe a better solution would be to allow the user to choose what glpyhs to replace:

  1. add a method to manipulate the $symbols member

    ->replaceGlyph('dimension', true | false) ??

  2. check if the glpyh needs to be replaced

    if (!empty($this->symbols['dimension'])) { // Dimension sign $this->glyph_search[] = '/([0-9]+[])]?[\'"]? ?)[xX]( ?[[(]?)(?=[+-]?'.$cur.'[0-9]*.?[0-9]+)/'. $this->regex_snippets['mod']; $this->glyph_replace[] = '$1'.$this->symbols['dimension'].'$2';

    }

netcarver commented 8 years ago

Thanks for the feedback @lauri-elevant. Could you and @neffets please try out the issue-158 branch and let me know if this solves the issue for you? The new branch allows you to solve the problem by callingsetStrictDimensionSymbolParsing(true) before parsing your text.

The new strict dimension parsing mode will only perform dimension sign replacements if there are spaces around the x or X in the source text. In other words, 1 x 2 will lead to a replacement (as there are spaces before and after the x) but 1x2, 1 x2 and 1x 2 will not trigger a replacement. This should allow ISBNs, hexadecimal notation and other encodings with embedded 'x' chars to go through unaltered.

Hope that helps!

neffets commented 8 years ago

No. Now in branch issue-158 with setStrictDimensionSymbolParsing(true) it replaces nothing.

here is my text example

DE000A13R6X4 -> the X should remain an uppercase letter
34x25 12X2 -> here its allowed to become twice an dimension sign x
|DE000A13R6X4|in tables too| not allowed |
| 34X45 |Text 12x3 | here both are allowed |
| 34XAnton | Text 34XAnton | not allowed |

It does not replace the patterns " 34X45 " or " 12x3 "

netcarver commented 8 years ago

@neffets, please try out the issue-158-take-2 branch. You shouldn't need to do any special setup this time.

neffets commented 8 years ago

Yep, the branch "issue-158-take-2" is working OK.

netcarver commented 8 years ago

Merged into master in ed249d3. Thanks for the report.

lauri-elevant commented 8 years ago

7X2ULJWT3TI ..still is converted to: 7×2ULJWT3TI

(DE000A13R6X4 is okay)

Would it maybe be still better to be able to turn off certain glyph substitutions?