pkp / ots

PKP XML Parsing Service
GNU General Public License v3.0
32 stars 19 forks source link

References are being replaced into XML with wrong ref id #91

Open axfelix opened 7 years ago

axfelix commented 7 years ago

I need to track down where this is happening currently, but I've noticed that some Word documents have ref id's that look like id="ID83ddb41b-5d29-4b6f-b862-74e019db4ec7" after being processed by meTypeset, but wind up with "R56" after being output by our stack. Something is replacing the original IDs...

axfelix commented 7 years ago

Probably being caused by https://github.com/pkp/xmlps/blob/master/module/ReferencesConversion/src/ReferencesConversion/Model/Converter/References.php#L311, which was done to avoid Pandoc breaking on non-numeric ref IDs. Need to think of a way to fix this without breaking inline ref IDs...

axfelix commented 7 years ago

https://github.com/MartinPaulEve/meTypeset/issues/104

axfelix commented 7 years ago

Actually @kaschioudi , I'm noticing that we seem to be replacing the wrong Ref ID numbers into XML documents processed from PDF via Cermine too -- this might be a wider problem in our implementation...

Vitaliy-1 commented 7 years ago

Actually, there are many issues with generating references. For example, all strings between ( ) are considering with meTypeset as citation, which is really inconvenient. Placing citations in square brackets do not solve the problem, because occasionally numbers not in square brackets also are parsed as references.

For now I have wrote the Java code that parses all references in square brackets and put to them needed id`s. Also I thinks maybe it is better to write the Java app with JAXB library that will parse JATS after DOCX XSLT transformation and give well-formed JATS as output.

axfelix commented 7 years ago

We're aware that meTypeset overdetects parentheses as references -- I actually thought that this must be due to some recent changes we made to it as I've been noticing the problem more and more lately, but I tried reverting to an older version and the problem is still there, so it turns out it just never came up to this extent in our earlier testing. It's flagged as an issue.

As for whether we're investing more effort into parsing pre-JATS transformation or post-JATS transformation, it's a balance to strike between "cleanness" and lossiness.

axfelix commented 7 years ago

@kaschioudi , I think I probably misspoke when I was asking you to fix https://github.com/pkp/xmlps/issues/50 -- we can't be arbitrarily incrementing ref IDs like in https://github.com/pkp/xmlps/blob/master/module/ReferencesConversion/src/ReferencesConversion/Model/Converter/References.php#L315, we need to match them to the inline xref rid whenever we change them.

Heidelberg's MPT script has a component which I believe is designed to recurse through meTypeset output and change UUIDs to integer ref IDs when needed, so I'm going to test that first: https://github.com/withanage/mpt/blob/master/static/tools/archive/postProcess.py#L502

axfelix commented 7 years ago

Actually, I'm afraid MPT might be too convoluted to add to the workflow just for this -- let's see if we can handle it directly in ReferencesConversion.

axfelix commented 7 years ago

OK, this is working and merged into master!

There still seem to be a few issues -- the attached doc has a few unmatched rid="ref5" attributes, but all of the xrefs following the pattern rid="R20" are now matched. Not sure what's causing the difference between "R#" and "ref#" but will look into it. Have removed the branch for now because it was mixed in with #92 when I did the merge, but leaving the issue open.

document.xml.txt 33345-106540-1-PB.pdf