Open trebunski opened 1 year ago
I believe this is the same issue as https://github.com/openpreserve/jhove/issues/877 - I found it too and submitted a pull requestion that is awaiting review: https://github.com/openpreserve/jhove/pull/878 Great that we have a file to test as I couldn't share the one that I had. I ran your file through my local version that includes the fix I submitted, and it no longer has the invalid character.
I'm not sure, but it seems like there's a lot more broken in the output than these 0xFFFE characters. Because in the result file, there are some hieroglyphs that are not in the source PDF.
Example: Source PDF contains following Figure-Descriptions: Figure 4. 13C- and 15N-CPMAS-NMR spectra of the different organic materials (cyanobacterium, clover, watermilfoil, peat moss) used in the experiment.
Figure 3. Mean NH2OH-to-N2O conversion ratios (RNH2OH-to-N2O) in artificial soils at different pH and MnO2 content, and for organic matter of different origins at a fixed content of 2.5% (w/w). The total amount of NH2OH added was 5nmol. Different symbols represent RNH2OH-to-N2O for the artificial soil mixtures with the different organic materials (n=3, SD<5%, not shown).
Result XML(look at my issue oppening result-xml-part) contains a value, which has a mix/parts of this two Strings (WHY?) and hieroglyphs (WHY?) inbetween:
Ah, good point, I see what you mean. So the pull request repairs the symptom but not the cause of this problem.
I looked at your PDF in a text editor - I see that lines 189418 to 189425 contain the objs with the begining and end of the text you see in the JHOVE output. It looks to me like it is reading in (probably 16-bit) character by character on line 189419, but something happens where it fails to handle the end of the line correctly. This garbles things so it misses the endobj and doesn't correct itself until 5 lines later (possibly by inversing what happened when it ends line 189423). It then picks up from "for organic matter..." at start of line 189424.
In that case, this could well be related to this legacy issue: https://github.com/openpreserve/jhove/issues/277 It used to cause a fatal NullPointerException or infinite loops. There have been various fixes (e.g. https://github.com/openpreserve/jhove/pull/652) that have helped resolve this in some instances... but it appears there is (unfortunately) still an issue with some scenarios. :(
Hi both, I'm back for the summer vacation yet and we'll take a look at this issue and review the PR for this year's release candidate.
Hello dear developers, I have found a PDF. The JHOVE processing of this file causes a misbehavior. A result XML is generated which contains non-valid characters for an XML document. The XML can then no longer be used by further systems.
Tested Version: release="1.26.1" date="2022-07-14
Example PDF: > https://epflicht.ulb.uni-bonn.de/download/pdf/363239?originalFilename=true
Reproduction command: /bin/sh jhove -c conf/jhove.conf -h XML -m PDF-hul Energy_Environment_390.pdf -o Energy_Environment_390-out.xml
Result: