pkp / ots

PKP XML Parsing Service
GNU General Public License v3.0
32 stars 19 forks source link

A merge module is needed to grab the best of each parser #53

Open jalperin opened 8 years ago

jalperin commented 8 years ago

A module that "matches" one parser's results with another, and can then selectively choose pieces from each would help make the most of what each parsing engine is good at.

The most obvious and immediate case is the front matter (described in issue 35) but there may be all kinds of other cases (some of which will become more obvious when we can compare the two approaches). For example, one parser might be better at picking out Table/Figure Captions than the other.

Ideas on how to best structure this module, how to do the XML comparisons, or even examples of XML snippets that might be problematic for merging would be welcome additions to this entry.

jalperin commented 8 years ago

sorry for the assignee mess. I was actually just trying to tag you both: @axfelix @kaschioudi .

axfelix commented 8 years ago

Based on a brief conversation I just had with @kaschioudi I just had to clarify some things in advance of our call, I wanted to mention that a merge module may use our test results directly, but does not necessarily have to do that, and probably won't over the long term. Directly using the test results (whether or not this involves programmatically parsing the results from the robot output, which is certainly doable but may be a distraction) should be useful when initially building out the module, simply to provide concrete rules for which tags we're preferring from which inputs while we're trying to code up the DOM manipulation in a relatively elegant and scalable way. Once this is working, we can then focus on dyamic comparisons in realtime between more parsing engines, as long as they all output JATS.

axfelix commented 8 years ago
axfelix commented 8 years ago
axfelix commented 8 years ago

Doing some research into this today. There's plenty of open source CLI XML diff tools, at least: http://www.mangrove.cz/diffmark/

axfelix commented 8 years ago

The most naive and straightforward way of doing this that I can see would be to use a straightforward XML "merge" tool like http://www2.informatik.hu-berlin.de/~obecker/XSLT/#merge, which would at least (theoretically) merge everything together into the same DOM hierarchy, and then remove all of the redundant elements that would be created by doing so. We'd probably have to do a lot of comparing of element values at that point to make sure that no redundant text is preserved, though, and we'd have to work out the best way to discard only one of two or more roughly "equal" strings or elements. Cases where a given string winds up in a <caption> tag in one document but only free text in another would be fairly straightforward -- we drop the free text -- but most cases won't be that straightforward.

axfelix commented 8 years ago

We may want to scope this down for the time being to just focusing on whether we can agreement on where ends and starts, to improve body text parsing, as that's a use case that's much narrower and better served by the current combination of parsing libraries.

axfelix commented 8 years ago

Any progress on this? As mentioned, it's possible to get meTypeset to use its "native" (not great) front matter output by commenting https://github.com/MartinPaulEve/meTypeset/blob/master/bin/meTypeset.py#L218, which we can safely do in our stack if we're always planning on merging CERMINE front matter -- probably a good idea to do so in order to be able to use meTypeset's existing front matter output for a comparison in working on merge.

kaschioudi commented 8 years ago

I have been looking into diffmark xml diff library. there's a PHP extension (http://php.net/manual/en/class.xmldiff-file.php) which I had to install through PECL.

I have created 2 samples xml files file1.xml.txt file2.xml.txt and a PHP test script diff.php.txt and was able to successfully create a diff file difference.xml.txt.

if we decide to pursue with that library, we would require a class to interpret the output diff xml format.

kaschioudi commented 8 years ago

I also came across this article which explains how to compare two different XML documents using standard tools such as diff: https://www.safaribooksonline.com/library/view/python-cookbook/0596001673/ch12s09.html

axfelix commented 8 years ago

OK, thanks. The "normalizing" an XML document is familiar -- I've used Beautiful Soup to do that in Python, there's probably an equivalent PHP library to accomplish the task that doesn't require you to manually list characters that need escaping.

I think diffing an entire document like this is going to be very difficult to work with -- should we try diffing "end of front matter plus some arbitrary amount of the beginning of the body text" across two documents to work on the front-matter-boundary issue? Maybe cut a document after the first or second </sec> and append </body></html> so it still validates? We could try running a subset of 100-200 documents from the coaction corpus like that and compare the diffs of the Cermine and meTypeset outputs (i.e. for both docx and pdf input), and see if there are consistent disagreements in front matter boundaries that way. @jalperin @jnicolls ?

axfelix commented 8 years ago

Also, even if we don't wind up comparing individual body elements this way because Cermine tends to under-tag relative to meTypeset, we should also investigate using this same method to better determine the boundary between <body> and <back>, like body and front.

kaschioudi commented 8 years ago

I just pushed a test module to compare differences between cermine and metypeset. In order to speed up execution, this new module execution was not integrated into current workflow since we are just testing at this moment. There's a small readme.txt (https://github.com/pkp/xmlps/blob/8938f6affff5daedf83687164c56134e5d6f527c/module/MergePlayGround/src/MergePlayGround/readme.txt).

Essentially to run the module, create the folder structure and copy docx and pdf files /corpus. Then run php module/MergePlayGround/src/MergePlayGround/merge.php and diff files will be in /diffs

axfelix commented 8 years ago

Excellent! I might not have time to look at this until Friday but very excited to try it out

axfelix commented 8 years ago

We sort of left this hanging, but right now we need to rearrange the merge so it gets called before the reference parsing code:

https://github.com/pkp/xmlps/commit/65c2685bebab0263d1202565cc344b8aed927450

Need to double-check which document the reference parsing modules were expecting up to now and change that, I believe it was the meTypeset output rather than the merged XML.