Closed BenMorel closed 2 years ago
This library attempts to preserve the original formatting as much as possible so that converting the Markdown back into HTML should give (nearly) identical results.
For example, I'm assuming that your project was parsing HTML like this:
<a href="https://www.example.com/r/pro/7"> https://www.example.com/r/pro/7 </a>
Or perhaps:
<a href="https://www.example.com/r/pro/7">
https://www.example.com/r/pro/7
</a>
Which is resulting in this Markdown:
[ https://www.example.com/r/pro/7 ](https://www.example.com/r/pro/7)
If you convert that back into HTML you'll get spacing similar to the original.
That being said, I'm not opposed to stripping whitespace in cases like this.
I am curious what would happen with HTML like this though:
See<a href="https://www.example.com/r/pro/7"> https://www.example.com/r/pro/7 </a>for more details
How would this input be handled given your proposed change?
Unfortunately in this case, you would end up with both values concatenated:
See<https://www.example.com/r/pro/7>
I don't mind personally, but I'd understand if you do.
What surprises me is that the other blank spaces are trimmed, including line breaks, so you should already have this issue in this case:
See<a href="https://www.example.com/r/pro/7">
https://www.example.com/r/pro/7
</a>for more details
Yeah... this is a tough one. Obviously, we're not handling spaces as well as we could. It seems like while this approach does handle certain cases, it makes others worse, and there's still an underlying problem with the general approach being used in this library.
When evaluating a potential change, there are basically three tests I like to apply:
In this case, I think the answers are:
I think a better solution would be one where the extra spacing is moved outside of the link, so that it's essentially treated like this:
-See<a href="https://www.example.com/r/pro/7"> https://www.example.com/r/pro/7 </a>for more details
+See <a href="https://www.example.com/r/pro/7">https://www.example.com/r/pro/7</a> for more details
Something like that would better satisfy a long-term solution to this problem. But that's also going to be more difficult to implement... so I'm on the fence with this PR, but leaning towards preferring a different solution instead (though I do appreciate your insights and effort here!!)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I'm not sure if omitting regular spaces was intentional, but we were very surprised by this behaviour while switching from soundasleep/html2text to this library.
In particular, here is some output in our project, before and after this PR.
Before:
After: