Can´t parse odt file - Githubissues

hcf-n commented 3 years ago

I'm trying to parse a simple odt file but Pandoc tells me Couldn't parse odt file.

Is there a way to get more informative output from the pandoc reader? I have trouble figurering out why the parser fails.

I have attached failing file. Please rename to mwe.odt

mwe.zip

mb21 commented 3 years ago

Weird indeed... https://odfvalidator.org says the file is conformant as well...

mb21 commented 3 years ago

Source seems to be this line... so looks like toArchiveOrFail from Codec.Archive.Zip package doesn't like the zip file... how was it created? using what program?

jgm commented 3 years ago

Actually it comes from line 87. (I tried changing the earlier line to give more informative output, and this showed me.)

jgm commented 3 years ago

I've added some more fine-grained error reporting to the reader, and now I can see it's failing on

runConverter' read_body startState contentElem

But unfortunately it's not easy to get more information than that without more radical changes. (The author of this module included a Failure type but currently it's just ()!)

I do note something odd about the content.xml file in the container, the line:

<?xtpipes  file='oo-text.4xt'?>

When I remove this, pandoc can convert the file. I'm not sure why this is there??

jgm commented 3 years ago

Maybe @MarLinn can help with this.

hcf-n commented 3 years ago

The odt file is produced by make4ht. I guess that explains why <?xtpipes file='oo-text.4xt'?>is there.

jgm commented 3 years ago

Aha, that makes sense. Our parser shouldn't fall apart because of that one line. But the odt reader is foreign territory for me; I'm hoping @MarLinn can see a way to fix this easily.

michal-h21 commented 3 years ago

Thanks for pointer to the <?xtpipes ?> instructions. It is internal instruction used by TeX4ht for some XML post-processing. It is unnecessary in the resulting ODT files, so make4ht will remove it from the final XML.

MarLinn commented 3 years ago

The ODT reader is based on the Text.XML.Light library. Which, from a few quick tests, seems to misunderstand processing instructions like this one. So in turn the reader does as well. It never occurred to me to test this before because I naïvely assumed this to be basic functionality for an XML library. The easiest fix would probably be a pull request against Text.XML.Light. Or, seeing how ODT files rarely contain processing instructions, we could wait and see if the summer-of-code xml library idea turns into something good.

jgm commented 3 years ago

@MarLinn, I tried running xml-light on this input, and it parsed the processing instruction as a regular element whose name begins with ?; it wasn't clear why the ODT parser would get confused by this parsing, since it could just be skipped. But maybe you saw something different?

Elem (Element {elName = QName {qName = "?xtpipes", qURI = Nothing, qPrefix = Nothing}, elAttribs = [Attr {attrKey = QName {qName = "file", qURI = Nothing, qPrefix = Nothing}, attrVal = "oo-text.4xt"}], elContent = [], elLine = Just 2})

jgm commented 3 years ago

Of course, if there is a problem, we could strip out processing instructions before passing the input to xml-light. But I'd like to be convinced first that xml-light is really the problem in this case.

MarLinn commented 3 years ago

Here's a slightly larger example:

<div>A<?pi?>B</div>

gets parsed as (simplified)

Element
    { elName = QName "div"
    , elContent =
        [ Text "A"
        , Element
            { elName    = QName "?pi?"
            , elContent = [ Text "B"]
            }
        ]
    }

Notice where B ends up. It's not on the same level as A any more. Instead xml-light interprets the processing instruction as a starting tag with a missing end tag. So all the content after such an instruction up until the next closing tag is interpreted as a child of the processing instruction "element" instead of as a sibling. And that's the structure that gets delivered to the ODT extractor. Now the reader could identify those spurious elements and flatten them away in some preprocessing step or similar, but that's fixing the wrong problem.

jgm commented 3 years ago

That does look like a serious issue (and should be reported or fixed in xml-light). But can it be the issue that we're facing here? My test above suggests that, in this case at least, the element is being closed before the rest of the content in the content.xml we have here. Have you tried running parseXML directly on the contents of content.xml in this mwe?

MarLinn commented 3 years ago

I tried it now and it seems the issue gets even weirder when the processing instruction is at the top level like in this case.

xml-light has two parse functions, parseXML and parseXMLDoc. When I parse the mwe with parseXML, the list of content chunks looks correct with no bad nesting. Somehow the processing instruction is handled correctly in this case. But the function we actually want, parseXMLDoc, takes the output of parseXML and throws away everything except for what it thinks is the first element. In this case the "first element" is the processing instruction, that's why it throws most of the document away and returns the result @jgm got.

So there's at least two bugs in the library.

We may be able to fix the exact one from this issue by rolling our own modified parseXMLDoc, but that doesn't fix the second issue if processing instructions occur deeper in the file.

jgm commented 3 years ago

I'm going to try to create a quick little bridge module so we can use xml-conduit's parser to parse to an xml-light type. That's an intermediate solution; if it works, we could gradually transition to xml-conduit, which seems well tested and performant. (We already transitively depend on xml-conduit via citeproc.)

jgm commented 3 years ago

@MarLinn I've pushed some code to the xmlbridge branch if you want to take a look. With this change, the odt reader tests all fail -- I think it's because of changes in the URI (namespace) part of qualified names, but I haven't had a chance to investigate yet.

jgm commented 3 years ago

@MarLinn I see what is happening; my new parseXMLDoc based on xml-conduit differs in one important way from the old one. The old one parses all the xmlns: prefixed attributes on the root element as attributes. The new one leaves these off, since the information is used in interpreting the prefixes while the document is being parsed. Your code seems to presuppose that these attributes will be there -- I wonder if there's a simple way to change that; then I think this will work.

To clarify, attributes like this are present on the root note with xml-light's parser, but not with xml-conduit's:

Attr {attrKey = QName {qName = "style", qURI = Nothing, qPrefix = Just "xmlns"}, attrVal = "urn:oasis:names:tc:opendocument:xmlns:style:1.0"},

MarLinn commented 3 years ago

Ah, good old namespaces. Odt has a lot of them, so they are deeply integrated into the reader, no element or attribute converter is defined without one. To facilitate that, the reader has its own namespace management. Another reason is that xml-light doesn't have one. The upside is that there's only a small hand full of places deep in the bowels where xml namespaces and reader managed namespaces are tied together and that would have to be changed.

What I'm not sure about yet is if xml-conduit keeps enough information to be integrated there. If it does, this should be doable.

jgm commented 3 years ago

Ah never mind, I've figured out a way to restore compatibility!

jgm commented 3 years ago

Got it working now, and in addition to parsing the mwe.odt correctly, we now get a big performance boost. I'm going to change some of the other parsers to use this first before merging with master branch.

jgm / pandoc

Can´t parse odt file #7091