thephpleague / html-to-markdown

Convert HTML to Markdown with PHP
MIT License
1.77k stars 205 forks source link

Problem with <p> in <div> #132

Closed AntonSmatanik closed 7 years ago

AntonSmatanik commented 7 years ago

When I'm parsing this code:

<h4>
    What Galaxy Pig has to offer?
</h4>
<div>
    <p>
        Galaxy Pig offers all round entertainment with big jackpots and lots of different Casino games. For each customer Galaxy Pig offers fun and excitement, from the latest video slot
        machines to the more traditional ones. You can also play table games like Roulette and Black Jack. Galaxy Pig also offers some special games, not found in most other Casinos.
        Whatever Casino game you like, you will find it at Galaxy Pig!
    </p>
</div>

Result is:

### What Galaxy Pig has to offer? 

<div>Galaxy Pig offers all round entertainment with big jackpots and lots of different Casino games. For each customer Galaxy Pig offers fun and excitement, from the latest video slot machines to the more traditional ones. You can also play table games like Roulette and Black Jack. Galaxy Pig also offers some special games, not found in most other Casinos. Whatever Casino game you like, you will find it at Galaxy Pig!

</div>

So p element is gone :(

andreskrey commented 7 years ago

I'm not sure where the problem here. What would be the expected result?

The div converter only escapes inner tags (through html_entity_decode) and the P converter escapes special sequences that are reserved to markdown.

Can you provide us with the markdown you are expecting from the converter?

AntonSmatanik commented 7 years ago

I think that expected is to not remove them. I'm converting html to markdown and vise versa. Result must be the same.

### What Galaxy Pig has to offer? 

<div><p>Galaxy Pig offers all round entertainment with big jackpots and lots of different Casino games. For each customer Galaxy Pig offers fun and excitement, from the latest video slot machines to the more traditional ones. You can also play table games like Roulette and Black Jack. Galaxy Pig also offers some special games, not found in most other Casinos. Whatever Casino game you like, you will find it at Galaxy Pig!</p></div>

Is there any option to preserve them?

andreskrey commented 7 years ago

To be honest I'm not sure what would be the correct approach. Commonmark doesn't have a specific sequence for paragraphs so basically every line that doesn't have any commonmark tag is a paragraph itself.

Maybe we should add a line break between paragraphs? So the final html would be something like:

<div>
Galaxy Pig offers all round entertainment with big jackpots and lots of different Casino games. For each customer Galaxy Pig offers fun and excitement, from the latest video slot machines to the more traditional ones. You can also play table games like Roulette and Black Jack. Galaxy Pig also offers some special games, not found in most other Casinos. Whatever Casino game you like, you will find it at Galaxy Pig!
</div>

Your markdown to html converter could detect the line without commonmark tags it and convert it as a p tag. But to be honest I'm not sure if this is correct. Maybe @colinodell has an idea around this.

AntonSmatanik commented 7 years ago

It is not good for me, because I need to preserve such elements. So for now I converting them from

to and then after all is finished back from to

. But it is quite strange aproach

andreskrey commented 7 years ago

There's an open PR right now that adds the functionality to keep certain tags. Unfortunately it has gone stale although you could import it to your own fork (and solve the current conflicts) as a workaround.

colinodell commented 7 years ago

I agree with @andreskrey:

Commonmark doesn't have a specific sequence for paragraphs so basically every line that doesn't have any commonmark tag is a paragraph itself.

<p> tags should not be preserved, at least not by default. If you really wanted to keep them, I'd try instantiating your own Environment and add every converter except for the Paragraph one:


$environment = new Environment();

$environment->addConverter(new BlockquoteConverter());
$environment->addConverter(new CodeConverter());
$environment->addConverter(new CommentConverter());
// $environment->addConverter(new DivConverter());
$environment->addConverter(new EmphasisConverter());
$environment->addConverter(new HardBreakConverter());
$environment->addConverter(new HeaderConverter());
$environment->addConverter(new HorizontalRuleConverter());
$environment->addConverter(new ImageConverter());
$environment->addConverter(new LinkConverter());
$environment->addConverter(new ListBlockConverter());
$environment->addConverter(new ListItemConverter());
// $environment->addConverter(new ParagraphConverter());
$environment->addConverter(new PreformattedConverter());
$environment->addConverter(new TextConverter());

$htmlToMarkdown = new HtmlConverter($environment);
``
AntonSmatanik commented 7 years ago

Thanks