jgm / pandoc-citeproc

Library and executable for using citeproc with pandoc
BSD 3-Clause "New" or "Revised" License
291 stars 61 forks source link

Looks like XML parse error, but based on location of CSL file #81

Closed magthe closed 10 years ago

magthe commented 10 years ago

After upgrading to 0.5 I've observed a very strange issue. Processing of several of my files resulted in an error like this:

pandoc -t latex --filter pandoc-citeproc --template template.latex --csl=style.csl -o lfs_system_utp.pdf lfs_system_utp_t.mkd
pandoc-citeproc: error while parsing the XML string
pandoc: Error running filter pandoc-citeproc

I simply stopped using the (slightly custom) CSL file I want to use and instead fell back on the default one that comes with pandoc-citeproc. That worked, and was all right for the moment.

After a few days I saw a message on a Haskell-related mailing list for the Arch Linux distro regarding this. That mail described a work-around: just replace the default CSL file with the one you want to use. Indeed, that works:

cp style.csl /usr/share/x86_64-linux-ghc-7.8.3/pandoc-citeproc-0.5/chicago-author-date.csl
pandoc -t latex --filter pandoc-citeproc --template template.latex -o lfs_system_utp.pdf lfs_system_utp_t.mkd

Clearly there is something going on here that is really surprising to a mere user.

jgm commented 10 years ago

Can you share your custom CSL file, so I can try to reproduce the problem?

+++ Magnus Therning [Sep 10 14 00:06 ]:

After upgrading to 0.5 I've observed a very strange issue. Processing of several of my files resulted in an error like this:

pandoc -t latex --filter pandoc-citeproc --template template.latex --csl=style.csl -o lfs_system_utp.pdf lfs_system_utp_t.mkd
pandoc-citeproc: error while parsing the XML string
pandoc: Error running filter pandoc-citeproc

I simply stopped using the (slightly custom) CSL file I want to use and instead fell back on the default one that comes with pandoc-citeproc. That worked, and was all right for the moment.

After a few days I saw a message on a Haskell-related mailing list for the Arch Linux distro regarding this. That mail described a work-around: just replace the default CSL file with the one you want to use. Indeed, that works:

cp style.csl /usr/share/x86_64-linux-ghc-7.8.3/pandoc-citeproc-0.5/chicago-author-date.csl
pandoc -t latex --filter pandoc-citeproc --template template.latex -o lfs_system_utp.pdf lfs_system_utp_t.mkd

Clearly there is something going on here that is really surprising to a mere user.


Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc-citeproc/issues/81

magthe commented 10 years ago

It's here: https://gist.github.com/magthe/4c45ed79f245f6712755

Just to be clear though, copying the default CSL to the local directory, and then using the --csl argument to pandoc also results in the error message from above. So I'd be very surprised if it really is an XML parsing problem.

jgm commented 10 years ago

+++ Magnus Therning [Sep 10 14 12:36 ]:

It's here: https://gist.github.com/magthe/4c45ed79f245f6712755

Just to be clear though, copying the default CSL to the local directory, and then using the --csl argument to pandoc also results in the error message from above. So I'd be very surprised if it really is an XML parsing problem.

Oh, thanks. That's a good clue.

jgm commented 10 years ago

I can't reproduce this. Did you install using cabal, or in some other way? If via cabal, can you send the output of ghc-pkg list?

magthe commented 10 years ago

I install using the distro package manager. Since I also maintain the packages involved I know that the output of ghc-pkg list reflects the build environment used.

Pandoc and pandoc-citeproc are built with the following flags:

pandoc  1.13.0.1-3 (-make-pandoc-man-pages https -trypandoc -embed_data_files)
pandoc-citeproc  0.4.0.1-3 (-test_citeproc -unicode_collation -embed_data_files -hexpat bibutils small_base)

This is the output of ghc-pkg list after installing pandoc-citeproc on a clean system:

/usr/lib/ghc-7.8.3/package.conf.d:
    Cabal-1.18.1.3
    HTTP-4000.2.18
    JuicyPixels-3.1.7.1
    SHA-1.6.4.1
    aeson-0.7.0.6
    aeson-pretty-0.7.1
    array-0.5.0.0
    asn1-encoding-0.8.1.3
    asn1-parse-0.8.1
    asn1-types-0.2.3
    attoparsec-0.11.3.4
    base-4.7.0.1
    base64-bytestring-1.0.0.1
    bin-package-db-0.0.0.0
    binary-0.7.1.0
    blaze-builder-0.3.3.2
    blaze-html-0.7.0.2
    blaze-markup-0.6.1.0
    rts-1.0
    byteable-0.1.1
    bytestring-0.10.4.0
    case-insensitive-1.2.0.0
    cereal-0.4.0.1
    cipher-aes-0.2.8
    cipher-des-0.0.6
    cipher-rc4-0.1.4
    cmdargs-0.10.9
    conduit-1.2.0.2
    connection-0.2.3
    containers-0.5.5.1
    cookie-0.4.1.3
    cprng-aes-0.5.2
    crypto-cipher-types-0.0.9
    crypto-numbers-0.2.3
    crypto-pubkey-0.2.4
    crypto-pubkey-types-0.4.2.2
    crypto-random-0.0.8
    cryptohash-0.11.6
    data-default-0.5.3
    data-default-class-0.0.1
    data-default-instances-base-0.0.1
    data-default-instances-containers-0.0.1
    data-default-instances-dlist-0.0.1
    data-default-instances-old-locale-0.0.1
    deepseq-1.3.0.2
    deepseq-generics-0.1.1.1
    digest-0.0.1.2
    directory-1.2.1.0
    dlist-0.7.1
    exceptions-0.6.1
    extensible-exceptions-0.1.1.4
    filepath-1.3.0.2
    (ghc-7.8.3)
    ghc-prim-0.3.1.0
    haddock-library-1.1.1
    hashable-1.2.2.0
    haskeline-0.7.1.2
    (haskell2010-1.1.2.0)
    (haskell98-2.0.0.3)
    highlighting-kate-0.5.9
    hoopl-3.10.0.1
    hpc-0.6.0.1
    hs-bibutils-5.0
    hslua-0.3.13
    http-client-0.3.8.2
    http-client-tls-0.2.2
    http-types-0.8.5
    integer-gmp-0.5.1.0
    lifted-base-0.2.3.0
    mime-types-0.1.0.4
    mmap-0.5.9
    mmorph-1.0.4
    monad-control-0.3.3.0
    mtl-2.1.3.1
    nats-0.2
    network-2.5.0.0
    old-locale-1.0.0.6
    old-time-1.1.0.2
    pandoc-1.13.1
    pandoc-citeproc-0.5
    pandoc-types-1.12.4.1
    parsec-3.1.5
    pem-0.2.2
    pretty-1.1.1.1
    primitive-0.5.3.0
    process-1.2.0.0
    publicsuffixlist-0.1
    random-1.0.1.3
    regex-base-0.93.2
    regex-pcre-builtin-0.94.4.8.8.35
    resourcet-1.1.2.3
    rfc5051-0.1.0.3
    scientific-0.3.3.0
    securemem-0.1.3
    semigroups-0.15.2
    socks-0.5.4
    split-0.2.2
    stm-2.4.3
    streaming-commons-0.1.4.2
    syb-0.4.2
    tagsoup-0.13.2
    template-haskell-2.9.0.0
    temporary-1.2.0.3
    terminfo-0.4.0.0
    texmath-0.8
    text-1.1.1.3
    time-1.4.2
    tls-1.2.9
    transformers-0.3.0.0
    transformers-base-0.4.3
    unix-2.7.0.1
    unordered-containers-0.2.5.0
    utf8-string-0.3.8
    vector-0.10.11.0
    void-0.6.1
    x509-1.4.12
    x509-store-1.4.4
    x509-system-1.4.5
    x509-validation-1.5.0
    xhtml-3000.2.1
    xml-1.3.13
    yaml-0.8.9.1
    zip-archive-0.2.3.4
    zlib-0.5.4.1
jgm commented 10 years ago

The -hexpat stands out as a non-default flag that would be different from my setup. Is there a reason you don't use hexpat? It is much faster. It may be that the non-hexpat configuration is now broken.

+++ Magnus Therning [Sep 11 14 03:24 ]:

I install using the distro package manager. Since I also maintain the packages involved I know that the output of ghc-pkg list reflects the build environment used.

Pandoc and pandoc-citeproc are built with the following flags:

pandoc  1.13.0.1-3 (-make-pandoc-man-pages https -trypandoc -embed_data_files)
pandoc-citeproc  0.4.0.1-3 (-test_citeproc -unicode_collation -embed_data_files -hexpat bibutils small_base)

This is the output of ghc-pkg list after installing pandoc-citeproc on a clean system:

/usr/lib/ghc-7.8.3/package.conf.d:
   Cabal-1.18.1.3
   HTTP-4000.2.18
   JuicyPixels-3.1.7.1
   SHA-1.6.4.1
   aeson-0.7.0.6
   aeson-pretty-0.7.1
   array-0.5.0.0
   asn1-encoding-0.8.1.3
   asn1-parse-0.8.1
   asn1-types-0.2.3
   attoparsec-0.11.3.4
   base-4.7.0.1
   base64-bytestring-1.0.0.1
   bin-package-db-0.0.0.0
   binary-0.7.1.0
   blaze-builder-0.3.3.2
   blaze-html-0.7.0.2
   blaze-markup-0.6.1.0
   rts-1.0
   byteable-0.1.1
   bytestring-0.10.4.0
   case-insensitive-1.2.0.0
   cereal-0.4.0.1
   cipher-aes-0.2.8
   cipher-des-0.0.6
   cipher-rc4-0.1.4
   cmdargs-0.10.9
   conduit-1.2.0.2
   connection-0.2.3
   containers-0.5.5.1
   cookie-0.4.1.3
   cprng-aes-0.5.2
   crypto-cipher-types-0.0.9
   crypto-numbers-0.2.3
   crypto-pubkey-0.2.4
   crypto-pubkey-types-0.4.2.2
   crypto-random-0.0.8
   cryptohash-0.11.6
   data-default-0.5.3
   data-default-class-0.0.1
   data-default-instances-base-0.0.1
   data-default-instances-containers-0.0.1
   data-default-instances-dlist-0.0.1
   data-default-instances-old-locale-0.0.1
   deepseq-1.3.0.2
   deepseq-generics-0.1.1.1
   digest-0.0.1.2
   directory-1.2.1.0
   dlist-0.7.1
   exceptions-0.6.1
   extensible-exceptions-0.1.1.4
   filepath-1.3.0.2
   (ghc-7.8.3)
   ghc-prim-0.3.1.0
   haddock-library-1.1.1
   hashable-1.2.2.0
   haskeline-0.7.1.2
   (haskell2010-1.1.2.0)
   (haskell98-2.0.0.3)
   highlighting-kate-0.5.9
   hoopl-3.10.0.1
   hpc-0.6.0.1
   hs-bibutils-5.0
   hslua-0.3.13
   http-client-0.3.8.2
   http-client-tls-0.2.2
   http-types-0.8.5
   integer-gmp-0.5.1.0
   lifted-base-0.2.3.0
   mime-types-0.1.0.4
   mmap-0.5.9
   mmorph-1.0.4
   monad-control-0.3.3.0
   mtl-2.1.3.1
   nats-0.2
   network-2.5.0.0
   old-locale-1.0.0.6
   old-time-1.1.0.2
   pandoc-1.13.1
   pandoc-citeproc-0.5
   pandoc-types-1.12.4.1
   parsec-3.1.5
   pem-0.2.2
   pretty-1.1.1.1
   primitive-0.5.3.0
   process-1.2.0.0
   publicsuffixlist-0.1
   random-1.0.1.3
   regex-base-0.93.2
   regex-pcre-builtin-0.94.4.8.8.35
   resourcet-1.1.2.3
   rfc5051-0.1.0.3
   scientific-0.3.3.0
   securemem-0.1.3
   semigroups-0.15.2
   socks-0.5.4
   split-0.2.2
   stm-2.4.3
   streaming-commons-0.1.4.2
   syb-0.4.2
   tagsoup-0.13.2
   template-haskell-2.9.0.0
   temporary-1.2.0.3
   terminfo-0.4.0.0
   texmath-0.8
   text-1.1.1.3
   time-1.4.2
   tls-1.2.9
   transformers-0.3.0.0
   transformers-base-0.4.3
   unix-2.7.0.1
   unordered-containers-0.2.5.0
   utf8-string-0.3.8
   vector-0.10.11.0
   void-0.6.1
   x509-1.4.12
   x509-store-1.4.4
   x509-system-1.4.5
   x509-validation-1.5.0
   xhtml-3000.2.1
   xml-1.3.13
   yaml-0.8.9.1
   zip-archive-0.2.3.4
   zlib-0.5.4.1

Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc-citeproc/issues/81#issuecomment-55245857

magthe commented 10 years ago

Well, hexpat isn't in our repo and since the dependencies can be satisfied without it that's what happens. Anyway, I modified the flag and pulled in hexpat and now it works fine. So indeed, it seems the non-hexpat XML parsing is broken.

jgm commented 10 years ago

I've just replaced the old xml-light and hexpat based CSL parsers with a new, xml-conduit-based one (pure Haskell). It is about twice as fast as the old hexpat based parser in my tests, and will be much easier to maintain and extend. This should solve this issue once it is released.

nylki commented 9 years ago

I still have this issue with pandoc-citeproc 0.5 on Fedora 22. Is there a fix for this situation? I suppose I'd have to build pandoc-citeproc myself to get the most recent version or wait until fedora puts it into their repository?

The workaround to replace the default .csl works, but it's obviously not a very practical solution.

ousia commented 9 years ago

I still have this issue with pandoc-citeproc 0.5 on Fedora 22. Is there a fix for this situation? I suppose I'd have to build pandoc-citeproc myself to get the most recent version or wait until fedora puts it into their repository?

@nylki, there is a copr repository with pandoc statically linked from Jens Petersen (https://copr.fedoraproject.org/coprs/petersen/pandoc/).

I have just asked him whether he could add the latest version from pandoc-citeproc.

nylki commented 9 years ago

@ousia thanks! have you got a response from Jens Peter?

ousia commented 9 years ago

@nylki, you have a subpackage at https://copr.fedoraproject.org/coprs/petersen/pandoc/ (only for Fedora 22 or newer).