locopablo / markdownsharp

Automatically exported from code.google.com/p/markdownsharp
0 stars 0 forks source link

HTML block detection regex from Markdown PHP -- is this correct? #20

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Please try this patch to implement the HashHTMLBlocks() algorithm from
Markdown PHP.

It *seems* to work, but results in two different unit test failures
compared to the HashHTMLBlocks() we inherited from Markdown.pl 1.0.1.

Did I translate the regex right? can someone double check my work here please?

The two unit test failures we have are related to faults in the
HashHTMLBlocks() routine, so I'd like to pull in the one from MarkdownPHP
if possible.. just need to verify if I got anything wrong here.

Original issue reported on code.google.com by wump...@gmail.com on 6 Jan 2010 at 7:25

Attachments:

GoogleCodeExporter commented 9 years ago
Well, I set the php nestdepth to 6 like it is in MarkdownSharp, and then 
printed out
both the patterns after construction, with all whitespace normalized to a single
space. As far as I can see the only difference between the two was that .cs was 
using
\2 in some places where php is using \3. The error was that these two lines:

            pattern = pattern.Replace("$content", content);
            pattern = pattern.Replace("$content2", content2);

Should be the other way around

            pattern = pattern.Replace("$content2", content2);
            pattern = pattern.Replace("$content", content);

Because if you first replace $content, that will also replace $content2, since
$content is a substring of $content2. Changing these two lines around and 
comparing
again it seemed that the patterns were exactly identical between php and cs. 
And so,
assuming that PHP works, then the CS version should work as well.

Original comment by hviturha...@gmail.com on 6 Jan 2010 at 9:22

Attachments:

GoogleCodeExporter commented 9 years ago
ah, excellent! thank you. I totally missed that.

Unfortunately, the corrected regex still results in 2 failing tests, in 
slightly 
disturbing ways. It looks like the PHP version has some additional logic around 
blocks, determining which ones get wrapped in <p> and which ones do not. 

:(

Original comment by wump...@gmail.com on 6 Jan 2010 at 9:59

GoogleCodeExporter commented 9 years ago
Well, it seems that one of the new failing tests, the ordered/unordered lists 
one
that wraps <hr /> in a <p> fails because there is too much trimming of lines 
after
the lists. At one point you have

[some list]
[blank line]
<hr />

But then after the list is processed the blank line before <hr /> goes away so 
we have

[processed list]
<hr />

which makes the block regex not match it since it requires blocks to have an 
empty
line above. So maybe adding a newline after a list, at least a top level list, 
might
fix that. Or just hacking it afterwards by putting a \n before any <hr> that 
doesn't
have a preceding blank line.

Original comment by hviturha...@gmail.com on 6 Jan 2010 at 10:13

GoogleCodeExporter commented 9 years ago
well, I have a different hack checked in now -- along with the new 
HashHTMLBlocks()
routine!! Thanks a million for your help on that.

The Unwrappable() is kind of nasty, maybe take a look and see if you have any 
better
ideas. But Unwrappable() seems mostly safe to me, if unpleasant..

Original comment by wump...@gmail.com on 6 Jan 2010 at 10:56

GoogleCodeExporter commented 9 years ago
>  So maybe adding a newline after a list

Yes, that was the fix -- pretty subtle, but in ListItemEvaluator() ...

item = RunBlockGamut(Outdent(item) + "\n");

Original comment by wump...@gmail.com on 7 Jan 2010 at 3:47