mity / md4c

C Markdown parser. Fast. SAX-like interface. Compliant to CommonMark specification.
MIT License
755 stars 138 forks source link

Adding an HTML to Markdown parser/converter (html2md) #240

Closed AvtechScientific closed 4 months ago

AvtechScientific commented 4 months ago

Please add also an HTML to Markdown parser/converter (html2md) and thank you for the great software!

step- commented 4 months ago

This same request came up not too long ago: #180.

ec1oud commented 4 months ago

That sounds complicated (because HTML is complicated, in general), unless you mean that only a subset should be supported. But if you have a use for Qt for some other purpose (typically for a portable GUI), you can use https://doc.qt.io/qt-6/qtextdocument.html#setHtml and then https://doc.qt.io/qt-6/qtextdocument.html#toMarkdown. Qt's HTML parser also only supports a subset. But it works ok for writing a wysiwyg "rich text" editor or viewer.

If other libraries already do this conversion, why duplicate effort? A comment on #180 does indeed link to another alternative (which I haven't tried). There are also plenty of xml parsers already, which could be pressed into service for this purpose.

AvtechScientific commented 4 months ago

Other libraries exist also for Markdown to HTML conversion, but we all are using md4c for a reason. And that's because of its extreme speed and elegance of implementation. So the conversion into opposite direction (html2md) will have the same advantages.

And yes - I indeed meant to parse/convert only a subset of the entire HTML. Exactly that subset that is generated by library's md2html. Imagine that you have rendered HTML of an document but the source was lost over time and you want to edit that document. In such a case you need a tool that will produce you the source Markdown doc.

On Tue, Feb 13, 2024 at 2:29 AM Shawn Rutledge @.***> wrote:

That sounds complicated (because HTML is complicated, in general), unless you mean that only a subset should be supported. But if you have a use for Qt for some other purpose (typically for a portable GUI), you can use https://doc.qt.io/qt-6/qtextdocument.html#setHtml and then https://doc.qt.io/qt-6/qtextdocument.html#toMarkdown. Qt's HTML parser also only supports a subset. But it works ok for writing a wysiwyg "rich text" editor or viewer.

If other libraries already do this conversion, why duplicate effort? A comment on #180 https://github.com/mity/md4c/issues/180 does indeed link to another alternative (which I haven't tried). There are also plenty of xml parsers already, which could be pressed into service for this purpose.

— Reply to this email directly, view it on GitHub https://github.com/mity/md4c/issues/240#issuecomment-1939869748, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC6NCKZSKHPZ7Y427NF3UW3YTKXXDAVCNFSM6AAAAABDE4PYWWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZZHA3DSNZUHA . You are receiving this because you authored the thread.Message ID: @.***>

step- commented 4 months ago

An alternative html to markdown converter is pandoc -f html -t markdown_strict, with many option to tune the conversion process.

mity commented 4 months ago

If anybody wants to contribute such code, I'm open to consider incorporating it into the tree. But I have no plans working on this myself.