What steps will reproduce the problem?
1. create lucene doc with an apostrophe ( ' ) in the data of a TermVector'd
field (e.g. "HTML4 doesn't like this &this too")
2. export to XML
IndexReader reader = IndexReader.open(fsDir, false);
XMLExporter exporter = new XMLExporter(reader, indexPath);
File xmlout = new File(tmpfile);
OutputStream os = new FileOutputStream(xmlout);
Ranges ranges = new Ranges();
int start = docid;
int end = start + 1;
ranges.set(start, end);
exporter.export(os, false, true, "index", ranges);
3. open with an HTML4 strict spec XML browser (try IE)
What is the expected output? What do you see instead?
should open and display as parsed XML. instead, gives an error of invalid XML
What version of the product are you using? On what operating system?
luke 1.0.1 on windows 7.
Please provide any additional information below.
Andrzej fixed the majority of this problem in Luke 0.9.9 (when inside field
data), but there is still a small fix remaining in org.getopt.luke.XMLExporter,
to not escape element attribute values (patch attached).
This patch also provides a minor correction to Util.xmlEscape()
The ' isn't a valid part of the HTML4 strict spec. So, the xml escapes
should generate output which is valid and can be rendered with any XML
interpreter. Some of the browser-based XML viewers choke on the ' when it
is inside of element attributes. ' will take care of it
Original issue reported on code.google.com by Craig.St...@gmail.com on 16 Apr 2011 at 3:56
Original issue reported on code.google.com by
Craig.St...@gmail.com
on 16 Apr 2011 at 3:56Attachments: