izyuumi / html2md-rs

HTML to Markdown Parser in Rust
https://crates.io/crates/html2md-rs
MIT License
13 stars 2 forks source link

Relax `html2md-rs`? #18

Closed getreu closed 6 months ago

getreu commented 6 months ago

I like your parser, it has very few dependencies and a solid design. I am currently testing it for tp-note. The use case: in case the clipboard contains HTML, a filter is needed to convert the input into CommonMark compliant Markdown. The input may contain complete HTML documents or just snippets, thus my previous feature request.

Going through your Rustdoc, I discovered panics if .... . Fortunately you offer safe variants of panicking functions. To my knowledge, it is common practice that libraries must never panic. What is your use case?

Feature request: Besides, my first tests indicate that your (safe) parser might be too strict for my use case. I probably would need a parser, which processes also imperfect/incorrect input. Even in cases where the input is not perfectly valid HTML5.

Once you publish 0.7 I test again how bad the copied HTML in the clipboard can be.

izyuumi commented 6 months ago

Would you share me an example of inputs and outputs that you would want to see that will not return an error even with malformed HTML snippets?

Also, I have the panic there because I started the library with panics, and I wanted to make sure the project had backwards compatibility. Although, I guess I should make it so that users can opt in to use the panic-able version of the library.

izyuumi commented 6 months ago

I just marked parse_html as deprecated, and it will be removed in future versions (af0dfe943b60af08e36440bb5f14ef10074954de). Thank you for the suggestion.

getreu commented 6 months ago

Would you share me an example of inputs and outputs that you would want to see that will not return an error even with malformed HTML snippets?

I am not at the office today. Tomorrow I will send you some examples. Tp-Note is processes to 2 kinds of HTML input:

  1. Pipes from curl, example:

    curl http://blog.getreu.net/ | tpnote

  2. HTML read from the system clipboard. This is probably the trickier part. I have no idea how well formed this HTML is. I will do some tests.

getreu commented 6 months ago

Below you find 3 examples of input data (originating from the clipboard):

log.txt log2.txt log3.txt

In the above examples, the <html> tag is omitted. When the input comes from wget bahn.de, it looks like

<!DOCTYPE html>
<html class="no-js" lang="de">
  <head>
<script> window.bahn = window.bahn || { 'env': 'prd--default', 'cdsOrigin': 'https://www.bahn.de', 'site': 'next-bahn-de', 'lang': 'de', 'cookiePrefix': 'DB4-pb', 'pagetitle': 'DB Fahrplan, Auskunft, Tickets, informieren und buchen - Deutsche Bahn', 'pagename': 'startseite', 'section': 'Content', 'uuid': 'a5a66ce9-1eaa-41d7-87d4-1c9e52ea2bb1', 'hasWebComponent': false, 'abTestingService': { 'eventsActive': true, 'sdkKey': '2VwfARuJAzeMmnZHy6KR3' }, 'idm': { 'url': 'https://accounts.bahn.de', 'realm': '/auth/realms/db', 'active': true } } </script>

<title>DB Fahrplan, Auskunft, Tickets, informieren und buchen - Deutsche Bahn</title>
izyuumi commented 6 months ago

What do you expect the output from the parser be?

getreu commented 6 months ago

log.txt.md log2.txt.md log3.txt.md

izyuumi commented 6 months ago

How's something like this? (36ec69b) Whatever unclosed tag will automatically be closed: see tests.

getreu commented 6 months ago

The examples I sent you describe my use case for the next feature of Tp-Note: the user copies parts of the browser page into the clipboard as HTML. Tp-Note reads the clipboard and creates a new Markdown note. This should make the browser plugin "copy as Markdown" obsolete. It tuns out, that the snippets in the clipboard are usually correct and well-formed HTML. So far my tests with Firefox under Linux.

izyuumi commented 6 months ago

So, can I close this issue?

getreu commented 6 months ago

Yes. when your library understands html and meta nodes, I'll test again. So far it looks good to me.