matthewwithanm / python-markdownify

Convert HTML to Markdown
MIT License
1.17k stars 140 forks source link

Escape all characters with Markdown significance #118

Closed jsm28 closed 8 months ago

jsm28 commented 8 months ago

There are many punctuation characters that sometimes have significance in Markdown; more systematically escape them all (based on a new escape_misc configuration option).

A limited attempt is made to limit the escaping of '.' and ')' to the context where they might have Markdown significance (after a number, where they can indicate an ordered list item); no such attempt is made for the other characters (and even that limiting of '.' and ')' may not be entirely safe in all cases, as it's possible the HTML could have the number outside the block being escaped in one go, e.g. <span>1</span>..

Fixes #99

AlexVonB commented 8 months ago

Thats a great contribution, thank you! I only removed the looping over the text with a re.sub, hope I did not introduce more complexity with that.

alfonsrv commented 2 months ago

This is a bad change and bad code that breaks a variety of things; it feels a lot like over-escaping.

E.g. even regular links like this https://admin.microsoft.com/AdminPortal/Home#/partners/invitation/granularAdminRelationships/ffffffffff-4440-40e3-a609-ffffffff-ffffffffff-ef31-4ef0-a6b8-ffffffffff now become completely unusable – though no markdown rendering engine would interpret any of the characters of markdown-significance https://admin.microsoft.com/AdminPortal/Home\\#/partners/invitation/granularAdminRelationships/ffffffffff\\-4440\\-40e3\\-a609\\-ffffffff\\-ffffffffff\\-ef31\\-4ef0\\-a6b8\\-ffffffffff\\

Especially setting escape_misc to True by default(!) and breaking already-implemented production usages of this library is an absolute faux-pas.

It is so bad that just 13 days after this PR was handed in, a new PR https://github.com/matthewwithanm/python-markdownify/pull/122 was submitted to fix what this one so heavy handedly "implemented".

valstu commented 2 months ago

Agree with @alfonsrv, the default should be set to False.