k-bx / boilerpipe

Automatically exported from code.google.com/p/boilerpipe
29 stars 3 forks source link

Faulty XML encoding of characters in <script> tags in <head> #60

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Run html with <script> tags in the <head> that contain characters like 
ampersands

What is the expected output? What do you see instead?

I would expect the scripts to survive verbatim

What version of the product are you using? On what operating system?

1.2.0

Please provide any additional information below.

As a workaround I changed:

//html.append(xmlEncode(String.valueOf(ch, start, length)));
html.append((String.valueOf(ch, start, length)));

in HTMLHighlighter.java

NOTE: I also changed the TAG_ACTIONS map to be empty, since our goal is to get 
a as verbatim as possible copy of the original HTML document with just small 
markers (class) on marked elements.. Short of emptying that map I could not 
figure out how to get the original <head> out of the document.

Original issue reported on code.google.com by tapa...@gmail.com on 14 Jan 2013 at 1:20