thephpleague / html-to-markdown

Convert HTML to Markdown with PHP
MIT License
1.77k stars 204 forks source link

Links with underscores will be escaped #204

Closed dragonito closed 3 years ago

dragonito commented 3 years ago

Hi there,

got a problem with a special kind of urls looking like

https://example.de/this_is_a_nice_file.ext

it will be converted to:

https://example.de/this\_is\_a\_nice\_file.ext

Perhaps you got an idea to fix it?

Cu Robin

colinodell commented 3 years ago

Would you be able to share the exact HTML that causes this behavior? That will help us to replicate the issue and come up with a fix :)

dragonito commented 3 years ago

Its really simple:

<p>https://www.dragonito.net/test_hallo_welt.pdf<br/></p>

will be

https://www.dragonito.net/test\_hallo\_welt.pdf

I use some options:

$options = [ 'strip_tags' => true, 'header_style' => 'atx', ];

hope it helps to find the problem, thanks a lot :D

colinodell commented 3 years ago

The problem is occurring here:

https://github.com/thephpleague/html-to-markdown/blob/7bd4aa559db243d1b3892d38c7853abacfd70a6c/src/Converter/TextConverter.php#L22-L24

This logic is unaware of the complex rules that CommonMark has for determining whether underscore characters within words would be treated as emphasis or not. Unfortunately, this isn't going to be easy to fix in a reliable way without porting over all of that delimiter parsing functionality, which is very low on my list of open-source priorities right now.

I'd be open to any ideas for easier fixes that fix this particular case without breaking our existing tests.

colinodell commented 3 years ago

A potential naive solution might be breaking this regex replacement into three steps:

  1. Escape any \ characters first (to prevent double-escaping later)
  2. Escape any _ characters that do not appear within words (without accounting for all of CommonMark's edge cases)
  3. Escape the other characters as usual

It wouldn't be perfect but it would be much closer to what you'd expect.

dragonito commented 3 years ago

hi @colinodell i played a little bit, added a test and check if the plain text is an url, if its an url I ignore _ escaping. Perhaps not the best solution, but a possible inspiration for fixing it

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.