thephpleague / html-to-markdown

Convert HTML to Markdown with PHP
MIT License
1.75k stars 204 forks source link

Beginning HTML tags in page not handled or stripped #253

Closed samveen closed 1 month ago

samveen commented 1 month ago

Version(s) affected

5.1.0

Description

Package versioning : php-league-html-to-markdown-5.1.0-1 installed via debian bookworm/main on arm64

I'm trying to convert the following example HTML (test.html):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Style-Type" content="text/css" />
  <meta name="generator" content="pandoc" />
  <title></title>
</head>
<body>
<ul>
<li>item 1</li>
</ul>
</body>
</html>

I'm running the included html-to-markdown script and getting the following output:

samveen@zero:/tmp$ html-to-markdown test.html 
<html xmlns="http://www.w3.org/1999/xhtml"><head> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"></meta> <meta content="text/css" http-equiv="Content-Style-Type"></meta> <meta content="pandoc" name="generator"></meta> <title></title></head><body>- item 1

How to reproduce

Example set to run via piped input as below:

cat  <<"__ENDL" | html-to-markdown
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Style-Type" content="text/css" />
  <meta name="generator" content="pandoc" />
  <title></title>
</head>
<body>
<ul>
<li>item 1</li>
</ul>
</body>
</html>
__ENDL
samveen commented 1 month ago

I did not read the documentation correctly. As listed on the project page:

By default, HTML To Markdown preserves HTML tags without Markdown equivalents, like \<span> and \<div>.

That includes <header> and <body> tags.