Closed getreu closed 6 months ago
Would you share me an example of inputs and outputs that you would want to see that will not return an error even with malformed HTML snippets?
Also, I have the panic there because I started the library with panics, and I wanted to make sure the project had backwards compatibility. Although, I guess I should make it so that users can opt in to use the panic-able version of the library.
I just marked parse_html
as deprecated, and it will be removed in future versions (af0dfe943b60af08e36440bb5f14ef10074954de). Thank you for the suggestion.
Would you share me an example of inputs and outputs that you would want to see that will not return an error even with malformed HTML snippets?
I am not at the office today. Tomorrow I will send you some examples. Tp-Note is processes to 2 kinds of HTML input:
Pipes from curl
, example:
curl http://blog.getreu.net/ | tpnote
HTML read from the system clipboard. This is probably the trickier part. I have no idea how well formed this HTML is. I will do some tests.
Below you find 3 examples of input data (originating from the clipboard):
In the above examples, the <html>
tag is omitted. When the input comes from wget bahn.de
, it looks like
<!DOCTYPE html>
<html class="no-js" lang="de">
<head>
<script> window.bahn = window.bahn || { 'env': 'prd--default', 'cdsOrigin': 'https://www.bahn.de', 'site': 'next-bahn-de', 'lang': 'de', 'cookiePrefix': 'DB4-pb', 'pagetitle': 'DB Fahrplan, Auskunft, Tickets, informieren und buchen - Deutsche Bahn', 'pagename': 'startseite', 'section': 'Content', 'uuid': 'a5a66ce9-1eaa-41d7-87d4-1c9e52ea2bb1', 'hasWebComponent': false, 'abTestingService': { 'eventsActive': true, 'sdkKey': '2VwfARuJAzeMmnZHy6KR3' }, 'idm': { 'url': 'https://accounts.bahn.de', 'realm': '/auth/realms/db', 'active': true } } </script>
<title>DB Fahrplan, Auskunft, Tickets, informieren und buchen - Deutsche Bahn</title>
What do you expect the output from the parser be?
The examples I sent you describe my use case for the next feature of Tp-Note: the user copies parts of the browser page into the clipboard as HTML. Tp-Note reads the clipboard and creates a new Markdown note. This should make the browser plugin "copy as Markdown" obsolete. It tuns out, that the snippets in the clipboard are usually correct and well-formed HTML. So far my tests with Firefox under Linux.
So, can I close this issue?
Yes. when your library understands html
and meta
nodes, I'll test again. So far it looks good to me.
I like your parser, it has very few dependencies and a solid design. I am currently testing it for tp-note. The use case: in case the clipboard contains HTML, a filter is needed to convert the input into CommonMark compliant Markdown. The input may contain complete HTML documents or just snippets, thus my previous feature request.
Going through your Rustdoc, I discovered panics if .... . Fortunately you offer safe variants of panicking functions. To my knowledge, it is common practice that libraries must never panic. What is your use case?
Feature request: Besides, my first tests indicate that your (safe) parser might be too strict for my use case. I probably would need a parser, which processes also imperfect/incorrect input. Even in cases where the input is not perfectly valid HTML5.
Once you publish 0.7 I test again how bad the copied HTML in the clipboard can be.