google / gumbo-parser

An HTML5 parsing library in pure C99
Apache License 2.0
5.16k stars 660 forks source link

How to correctly handle utf8 BOM encoded html text? #414

Closed perfgao closed 1 year ago

perfgao commented 5 years ago

When I use gumbo-parser to process html text whose encoding is UTF8 BOM, I find that the html text generated after parsing is disordered.

The contents of the html file such as:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3c.org/TR/1999/REC-html401-19991224/loose.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <meta http-equiv="Cache-Control" content="no-transform" />
    <title>test</title>
    <link rel="shortcut icon" href="favicon.ico" />
    <script src="/js/jquery-1.4.2.min.js"></script>
    <script src="/js/url.js"></script>
  </head>
  <body>
    <div id="main">
      <div id="nav_top">
        <div id="nav_top_frame">
          <a href="/guide.html" target="_blank" title="test" class="red f12"><b>test</b></a>
          <a href='/help.html' target='_blank' title='help' class='gray f12'>help</a>
        </div>
      </div>
    </div>
  </body>
</html>

save with encding UTF-8 BOM.

when I use examples/serialize.cc:

$ ./serialize test.html

will get

<html xmlns="http://www.w3.org/1999/xhtml">
<head></head><body>
 

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
    <meta http-equiv="Cache-Control" content="no-transform"/>
    <title>test</title>
    <link rel="shortcut icon" href="favicon.ico"/>
    <script src="/js/jquery-1.4.2.min.js"></script>
    <script src="/js/url.js"></script>

    <div id="main">
      <div id="nav_top">
        <div id="nav_top_frame">
          <a href="/guide.html" target="_blank" title="test" class="red f12"><b>test</b></a>
          <a href='/help.html' target='_blank' title='help' class='gray f12'>help</a>
        </div>
      </div>
    </div>
</body>
</html>