thephpleague / html-to-markdown

Convert HTML to Markdown with PHP
MIT License
1.77k stars 204 forks source link

Trim regular spaces from link text #220

Closed BenMorel closed 2 years ago

BenMorel commented 2 years ago

I'm not sure if omitting regular spaces was intentional, but we were very surprised by this behaviour while switching from soundasleep/html2text to this library.

In particular, here is some output in our project, before and after this PR.

Before:

[ https://www.example.com/r/pro/7 ](https://www.example.com/r/pro/7)
[ 09 80 80 00 00 ](tel:+33980800000)

After:

<https://www.example.com/r/pro/7>
[09 80 80 00 00](tel:+33980800000)
colinodell commented 2 years ago

This library attempts to preserve the original formatting as much as possible so that converting the Markdown back into HTML should give (nearly) identical results.

For example, I'm assuming that your project was parsing HTML like this:

<a href="https://www.example.com/r/pro/7"> https://www.example.com/r/pro/7 </a>

Or perhaps:

    <a href="https://www.example.com/r/pro/7">
        https://www.example.com/r/pro/7
    </a>

Which is resulting in this Markdown:

[ https://www.example.com/r/pro/7 ](https://www.example.com/r/pro/7)

If you convert that back into HTML you'll get spacing similar to the original.

That being said, I'm not opposed to stripping whitespace in cases like this.

I am curious what would happen with HTML like this though:

See<a href="https://www.example.com/r/pro/7"> https://www.example.com/r/pro/7 </a>for more details

How would this input be handled given your proposed change?

BenMorel commented 2 years ago

Unfortunately in this case, you would end up with both values concatenated:

See<https://www.example.com/r/pro/7>

I don't mind personally, but I'd understand if you do.

What surprises me is that the other blank spaces are trimmed, including line breaks, so you should already have this issue in this case:

See<a href="https://www.example.com/r/pro/7">
https://www.example.com/r/pro/7
</a>for more details
colinodell commented 2 years ago

Yeah... this is a tough one. Obviously, we're not handling spaces as well as we could. It seems like while this approach does handle certain cases, it makes others worse, and there's still an underlying problem with the general approach being used in this library.

When evaluating a potential change, there are basically three tests I like to apply:

  1. Does the change produce better Markdown?
  2. If that Markdown is converted back into HTML are the results similar enough?
  3. Is the problem ultimately caused by invalid HTML or poor handling on our end?

In this case, I think the answers are:

  1. Maybe - strong arguments could be made either way
  2. No - the results are significantly different
  3. Poor handling on our end IMO

I think a better solution would be one where the extra spacing is moved outside of the link, so that it's essentially treated like this:

-See<a href="https://www.example.com/r/pro/7"> https://www.example.com/r/pro/7 </a>for more details
+See <a href="https://www.example.com/r/pro/7">https://www.example.com/r/pro/7</a> for more details

Something like that would better satisfy a long-term solution to this problem. But that's also going to be more difficult to implement... so I'm on the fence with this PR, but leaning towards preferring a different solution instead (though I do appreciate your insights and effort here!!)

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.