seanjensengrey / boilerpipe

Automatically exported from code.google.com/p/boilerpipe
0 stars 0 forks source link

Bad xml format in html output from Web API #33

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
• What steps will reproduce the problem?
Get an html or htmlFragment from any page

• What is the expected output? What do you see instead?
The output have an xml declaration, but instead of a valid html/xml structure 
there are extra tags that break the xml:

<?xml version="1.0" encoding="utf-8" ?>
<meta …/>
<base … />
<html>
  <body>
    ...
  </body>
</html>

And in the <html> the style comes directly after the <html> and not in a <head>.

The correct output would be:

<?xml version="1.0" encoding="utf-8" ?>
<html>
  <head>
    <meta …/>
    <base … />
    <style>...</style>
  </head>
  <body>
    ...
  </body>
</html>

• What version of the product are you using? On what operating system?

The Web API http://boilerpipe-web.appspot.com/extract

And thanks for this great *GREAT* tool!!!

--
François

Original issue reported on code.google.com by francois...@gmail.com on 3 Dec 2011 at 4:13

GoogleCodeExporter commented 9 years ago
Hi François,

thanks for pointing this out.

The addition of meta and base was a deliberate decision (it was just easier to 
append it in front of the highlighted HTML). Nevertheless, it is worth fixing.

Cheers,
Christian

Original comment by ckkohl79 on 22 Jan 2012 at 10:57