suchnsuch / Tangent

The publicly-available modules of the Tangent project.
65 stars 5 forks source link

[html-to-markdown] Turndown allows whitespace from header elements. #6

Open taylorhadden opened 1 year ago

taylorhadden commented 1 year ago

Given this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta http-equiv="Content-Style-Type" content="text/css">
<title></title>
<meta name="Generator" content="Cocoa HTML Writer">
<meta name="CocoaVersion" content="2113.5">
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica}
p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; min-height: 14.0px}
</style>
</head>
<body>
<p class="p1"><b>This is a test</b></p>
<p class="p1">And this is something else</p>
<p class="p2"><br></p>
<p class="p1">And something else still.</p>
</body>
</html>

Raw turndown will read the text characters between the meta, title, and span nodes and insert them right into the output. This is pretty grody, and requires pre-processing of out input to extract the body before allowing Turndown to do its thing. Looks like the head and body tags are stripped entirely once processed by turndown. Not great.

Forward progress on turndown looks like it has stalled. No new updates for 8 months and issues are piling up.