Weird implementation of _DoItalicsAndBold

rantecki / pagedown

Automatically exported from code.google.com/p/pagedown

Other

0 stars 0 forks source link

Weird implementation of _DoItalicsAndBold #71

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?

Call the converter using the following markdown:
abc**def**ghi *abc*,*def*

What is the expected output?

With Gruber's implementation, the HTML output is:
<p>abc<strong>def</strong>ghi <em>abc</em>,<em>def</em></p>

If intra-word bold is disabled (like in GFM markdown), the expected HTML is:
<p>abc**def**ghi <em>abc</em>,<em>def</em></p>

What do you see instead?

Intra-word bold is converted into italic and the second word in italic is not 
converted
<p>abc*<em>def</em>*ghi <em>abc</em>,*def*</p>

I fixed the issue by reimplementing the function _DoItalicsAndBold:

        function _DoItalicsAndBold(text) {

            // <strong> must go first:
            text = text.replace(/([^\w*]|^)(\*\*|__)(?=\S)(.+?[*_]*)(?=\S)\2(?=[^\w*]|$)/g,
            "$1<strong>$3</strong>");

            text = text.replace(/([^\w*]|^)(\*|_)(?=\S)(.+?)(?=\S)\2(?=[^\w*]|$)/g,
            "$1<em>$3</em>");

            return text;
        }

Original issue reported on code.google.com by benoit.schweblin@gmail.com on 27 Dec 2013 at 6:27

GoogleCodeExporter commented 8 years ago

Further details:

In the current implementation, an italic/bold phrase has to be surrounded by 
[\W_] characters (including * and _), which means *b* is considered as bold 
instead of an intra-word pattern to be ignored when found in a**b**c (just like 
a__b__c). The correction is to replace [\W_] by [^\w*] in order to exclude * 
and _.

The other issue in the current implementation is that the previous and the next 
characters are both recorded as groups in the regular expressions, which makes 
it impossible to have two italic/bold phrases separated by only one character. 
For instance, *a* *b* won't work, and it's also problematic with non ASCII 
characters: https://github.com/benweet/stackedit/issues/243
The second correction here is just to replace the last capturing group by a 
positive lookahead.

Original comment by benoit.schweblin@gmail.com on 6 Jan 2014 at 5:39

GoogleCodeExporter commented 8 years ago

Deviating from the spec w.r.t. the "no intra-word styling" was a conscious 
decision by us (and many other Markdown implementations); see 
http://stackoverflow.com/a/10310136/115866.

The other issue you mention in your comment is fixed in 
https://code.google.com/p/pagedown/source/detail?r=ec83ecf78c0c01ab8725c0b9614b9
af6c4f6ecf5.

Original comment by b...@stackoverflow.com on 11 Apr 2014 at 8:31

Changed state: Fixed

GoogleCodeExporter commented 8 years ago

I think you are missing the first point. Your implementation does not comply to 
"no intra-word styling".
abc__def__ghi is supposed to give <p>abc__def__ghi</p> as on GitHub.
Your implementation gives <p>abc_<em>def</em>_ghi</p>.

Original comment by benoit.schweblin@gmail.com on 11 Apr 2014 at 8:54

GoogleCodeExporter commented 8 years ago

Oh yeah, I misread that indeed. I'll look at it. Thanks for being insistent :)

Original comment by b...@stackoverflow.com on 11 Apr 2014 at 9:28

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

I couldn't use your version because it breaks various other things (notably 
around newlines). However I have totally reworked the bold/italics regexes, and 
this change also fixes this issue.

Changeset: 
https://code.google.com/p/pagedown/source/detail?r=39afe9b2d865cce8e1aa065221525
9167169006c

More info: http://meta.stackoverflow.com/a/229236/115866

Original comment by b...@stackoverflow.com on 12 Apr 2014 at 6:52

Changed state: Fixed