Closed hcf-n closed 3 years ago
Weird indeed... https://odfvalidator.org says the file is conformant as well...
Source seems to be this line... so looks like toArchiveOrFail
from Codec.Archive.Zip package doesn't like the zip file... how was it created? using what program?
Actually it comes from line 87. (I tried changing the earlier line to give more informative output, and this showed me.)
I've added some more fine-grained error reporting to the reader, and now I can see it's failing on
runConverter' read_body startState contentElem
But unfortunately it's not easy to get more information than that without more radical changes. (The author of this module included a Failure type but currently it's just ()!)
I do note something odd about the content.xml file in the container, the line:
<?xtpipes file='oo-text.4xt'?>
When I remove this, pandoc can convert the file. I'm not sure why this is there??
Maybe @MarLinn can help with this.
The odt file is produced by make4ht. I guess that explains why <?xtpipes file='oo-text.4xt'?>
is there.
Aha, that makes sense. Our parser shouldn't fall apart because of that one line. But the odt reader is foreign territory for me; I'm hoping @MarLinn can see a way to fix this easily.
Thanks for pointer to the <?xtpipes ?>
instructions. It is internal instruction used by TeX4ht for some XML post-processing. It is unnecessary in the resulting ODT files, so make4ht
will remove it from the final XML.
The ODT reader is based on the Text.XML.Light library. Which, from a few quick tests, seems to misunderstand processing instructions like this one. So in turn the reader does as well. It never occurred to me to test this before because I naïvely assumed this to be basic functionality for an XML library. The easiest fix would probably be a pull request against Text.XML.Light. Or, seeing how ODT files rarely contain processing instructions, we could wait and see if the summer-of-code xml library idea turns into something good.
@MarLinn, I tried running xml-light on this input, and it parsed the processing instruction as a regular element whose name begins with ?
; it wasn't clear why the ODT parser would get confused by this parsing, since it could just be skipped. But maybe you saw something different?
Elem (Element {elName = QName {qName = "?xtpipes", qURI = Nothing, qPrefix = Nothing}, elAttribs = [Attr {attrKey = QName {qName = "file", qURI = Nothing, qPrefix = Nothing}, attrVal = "oo-text.4xt"}], elContent = [], elLine = Just 2})
Of course, if there is a problem, we could strip out processing instructions before passing the input to xml-light. But I'd like to be convinced first that xml-light is really the problem in this case.
Here's a slightly larger example:
<div>A<?pi?>B</div>
gets parsed as (simplified)
Element
{ elName = QName "div"
, elContent =
[ Text "A"
, Element
{ elName = QName "?pi?"
, elContent = [ Text "B"]
}
]
}
Notice where B
ends up. It's not on the same level as A
any more. Instead xml-light interprets the processing instruction as a starting tag with a missing end tag. So all the content after such an instruction up until the next closing tag is interpreted as a child of the processing instruction "element" instead of as a sibling. And that's the structure that gets delivered to the ODT extractor. Now the reader could identify those spurious elements and flatten them away in some preprocessing step or similar, but that's fixing the wrong problem.
That does look like a serious issue (and should be reported or fixed in xml-light). But can it be the issue that we're facing here? My test above suggests that, in this case at least, the element is being closed before the rest of the content in the content.xml we have here. Have you tried running parseXML directly on the contents of content.xml in this mwe?
I tried it now and it seems the issue gets even weirder when the processing instruction is at the top level like in this case.
xml-light has two parse functions, parseXML and parseXMLDoc. When I parse the mwe with parseXML, the list of content chunks looks correct with no bad nesting. Somehow the processing instruction is handled correctly in this case. But the function we actually want, parseXMLDoc, takes the output of parseXML and throws away everything except for what it thinks is the first element. In this case the "first element" is the processing instruction, that's why it throws most of the document away and returns the result @jgm got.
So there's at least two bugs in the library.
We may be able to fix the exact one from this issue by rolling our own modified parseXMLDoc, but that doesn't fix the second issue if processing instructions occur deeper in the file.
I'm going to try to create a quick little bridge module so we can use xml-conduit's parser to parse to an xml-light type. That's an intermediate solution; if it works, we could gradually transition to xml-conduit, which seems well tested and performant. (We already transitively depend on xml-conduit via citeproc.)
@MarLinn I've pushed some code to the xmlbridge
branch if you want to take a look. With this change, the odt reader tests all fail -- I think it's because of changes in the URI (namespace) part of qualified names, but I haven't had a chance to investigate yet.
@MarLinn I see what is happening; my new parseXMLDoc
based on xml-conduit differs in one important way from the old one. The old one parses all the xmlns:
prefixed attributes on the root element as attributes. The new one leaves these off, since the information is used in interpreting the prefixes while the document is being parsed. Your code seems to presuppose that these attributes will be there -- I wonder if there's a simple way to change that; then I think this will work.
To clarify, attributes like this are present on the root note with xml-light's parser, but not with xml-conduit's:
Attr {attrKey = QName {qName = "style", qURI = Nothing, qPrefix = Just "xmlns"}, attrVal = "urn:oasis:names:tc:opendocument:xmlns:style:1.0"},
Ah, good old namespaces. Odt has a lot of them, so they are deeply integrated into the reader, no element or attribute converter is defined without one. To facilitate that, the reader has its own namespace management. Another reason is that xml-light doesn't have one. The upside is that there's only a small hand full of places deep in the bowels where xml namespaces and reader managed namespaces are tied together and that would have to be changed.
What I'm not sure about yet is if xml-conduit keeps enough information to be integrated there. If it does, this should be doable.
Ah never mind, I've figured out a way to restore compatibility!
Got it working now, and in addition to parsing the mwe.odt correctly, we now get a big performance boost. I'm going to change some of the other parsers to use this first before merging with master branch.
I'm trying to parse a simple odt file but Pandoc tells me
Couldn't parse odt file
.Is there a way to get more informative output from the pandoc reader? I have trouble figurering out why the parser fails.
I have attached failing file. Please rename to mwe.odt
mwe.zip