openpreserve / jhove

File validation and characterisation.
http://jhove.openpreservation.org
Other
171 stars 79 forks source link

Broken XML-Result #880

Open trebunski opened 1 year ago

trebunski commented 1 year ago

Hello dear developers, I have found a PDF. The JHOVE processing of this file causes a misbehavior. A result XML is generated which contains non-valid characters for an XML document. The XML can then no longer be used by further systems.

Tested Version: release="1.26.1" date="2022-07-14

Example PDF: > https://epflicht.ulb.uni-bonn.de/download/pdf/363239?originalFilename=true

Reproduction command: /bin/sh jhove -c conf/jhove.conf -h XML -m PDF-hul Energy_Environment_390.pdf -o Energy_Environment_390-out.xml

Result:


<?xml version="1.0" encoding="UTF-8"?>
<jhove xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schema.openpreservation.org/ois/xml/ns/jhove" xsi:schemaLocation="[http://schema.openpreservation.org/ois/xml/ns/jhove https://schema.openpreservation.org/ois/xml/xsd/jhove/1.8/jhove.xsd](http://schema.openpreservation.org/ois/xml/ns/jhove%20https:/schema.openpreservation.org/ois/xml/xsd/jhove/1.8/jhove.xsd)" name="Jhove" release="1.26.1" date="2022-07-14">
…
<property>
         <name>Item</name>
         <values arity="List" type="Property">
         <property>
          <name>Title</name>
          <values arity="Scalar" type="String">
           <value>Figure 4.  13C- and 15N-CPMAS-NMR spectra of the different organic materials (cyanobacterium, clover, watermilfoil, peat 洀漀猀猀 甀猀攀搀 椀渀 琀栀攀 攀砀瀀攀爀椀洀攀渀琀⸩㸾൥湤潢樍㜴㠶‰扪഼㰯䍛〮〠〮〠〮そ⽃潵湴‰⽄敳瑛㐲㌹‰⁒⽘奚‱㔵⸰′㌶⸰″㠰㔮そ⽆‰⽎數琠㜴㠵‰⁒⽐慲敮琠㜴㈰‰⁒⽐牥瘠㜴㠷‰⁒⽔楴汥⣾￾＀䘀椀最甀爀攀 ㏾￾＀⻾￾@˾＀ 䴀攀愀渀 一䠀㈀伀䠀ⴀ琀漀ⴀ一㈀伀 挀漀渀瘀攀爀猀椀漀渀 爀愀琀椀漀猀 刀一䠀㈀伀䠀ⴀ琀漀ⴀ一㈀伀 椀渀 愀爀琀椀昀椀挀椀愀氀 猀漀椀氀猀 愀琀 搀椀昀昀攀爀攀渀琀 瀀䠀 愀渀搀 䴀渀伀㈀ 挀漀渀琀攀渀琀Ⰰ 愀渀 for organic matter of different origins at a fixed content of 2.</value>
          </values>
         </property>
         <property>
…
</jhove>

.```

following errors:
error on line 1887 at column 244: Char 0xFFFE out of allowed range
karenhanson commented 1 year ago

I believe this is the same issue as https://github.com/openpreserve/jhove/issues/877 - I found it too and submitted a pull requestion that is awaiting review: https://github.com/openpreserve/jhove/pull/878 Great that we have a file to test as I couldn't share the one that I had. I ran your file through my local version that includes the fix I submitted, and it no longer has the invalid character.

trebunski commented 1 year ago

I'm not sure, but it seems like there's a lot more broken in the output than these 0xFFFE characters. Because in the result file, there are some hieroglyphs that are not in the source PDF.

Example: Source PDF contains following Figure-Descriptions: Figure 4. 13C- and 15N-CPMAS-NMR spectra of the different organic materials (cyanobacterium, clover, watermilfoil, peat moss) used in the experiment.

Figure 3. Mean NH2OH-to-N2O conversion ratios (RNH2OH-to-N2O) in artificial soils at different pH and MnO2 content, and for organic matter of different origins at a fixed content of 2.5% (w/w). The total amount of NH2OH added was 5nmol. Different symbols represent RNH2OH-to-N2O for the artificial soil mixtures with the different organic materials (n=3, SD<5%, not shown).

Result XML(look at my issue oppening result-xml-part) contains a value, which has a mix/parts of this two Strings (WHY?) and hieroglyphs (WHY?) inbetween:

Figure 4.  13C- and 15N-CPMAS-NMR spectra of the different organic materials (cyanobacterium, clover, watermilfoil, peat 洀漀猀猀 甀猀攀搀 椀渀 琀栀攀 攀砀瀀攀爀椀洀攀渀琀⸩㸾൥湤潢樍㜴㠶‰扪഼㰯䍛〮〠〮〠〮そ⽃潵湴‰⽄敳瑛㐲㌹‰⁒⽘奚‱㔵⸰′㌶⸰″㠰㔮そ⽆‰⽎數琠㜴㠵‰⁒⽐慲敮琠㜴㈰‰⁒⽐牥瘠㜴㠷‰⁒⽔楴汥⣾�＀䘀椀最甀爀攀 ㏾�＀⻾�@˾＀ 䴀攀愀渀 一䠀㈀伀䠀ⴀ琀漀ⴀ一㈀伀 挀漀渀瘀攀爀猀椀漀渀 爀愀琀椀漀猀 刀一䠀㈀伀䠀ⴀ琀漀ⴀ一㈀伀 椀渀 愀爀琀椀昀椀挀椀愀氀 猀漀椀氀猀 愀琀 搀椀昀昀攀爀攀渀琀 瀀䠀 愀渀搀 䴀渀伀㈀ 挀漀渀琀攀渀琀Ⰰ 愀渀 for organic matter of different origins at a fixed content of 2.
karenhanson commented 1 year ago

Ah, good point, I see what you mean. So the pull request repairs the symptom but not the cause of this problem.

I looked at your PDF in a text editor - I see that lines 189418 to 189425 contain the objs with the begining and end of the text you see in the JHOVE output. It looks to me like it is reading in (probably 16-bit) character by character on line 189419, but something happens where it fails to handle the end of the line correctly. This garbles things so it misses the endobj and doesn't correct itself until 5 lines later (possibly by inversing what happened when it ends line 189423). It then picks up from "for organic matter..." at start of line 189424.

In that case, this could well be related to this legacy issue: https://github.com/openpreserve/jhove/issues/277 It used to cause a fatal NullPointerException or infinite loops. There have been various fixes (e.g. https://github.com/openpreserve/jhove/pull/652) that have helped resolve this in some instances... but it appears there is (unfortunately) still an issue with some scenarios. :(

carlwilson commented 1 year ago

Hi both, I'm back for the summer vacation yet and we'll take a look at this issue and review the PR for this year's release candidate.