pkp / ots

PKP XML Parsing Service
GNU General Public License v3.0
32 stars 19 forks source link

Integrate CERMINE #15

Closed crism closed 9 years ago

crism commented 9 years ago

Integrate https://github.com/CeON/CERMINE as a module.

axfelix commented 9 years ago

This requires:

Another LibreOffice conversion (from Docx -> PDF for input to CERMINE) CERMINE call Stitching XML outputs from meTypeset body+back and CERMINE front matter together

crism commented 9 years ago

Looks like we should be using the standalone CERMINE jar via command-line? Or do we want to try to make a crossover call from PHP into Java? (Is that even possible?)

axfelix commented 9 years ago

I don't think there's gonna be anything more sensible than exec()

crism commented 9 years ago

Integrate DocX→PDF into DocxConversion module? Or make a new one? (Would need one new converter and one new queue job. Do we want to keep one of each in each module, or is it OK to double up?)

axfelix commented 9 years ago

Good question. I think I'd probably lean toward making it part of the DocxConversion module so that we aren't potentially duplicating any LibreOffice env scaffolding in two different places in our stack? But I can't immediately think of whether calling one module two different ways at different points in the conversion will pose problems for the way we've built this (it shouldn't).

crism commented 9 years ago

Let’s do that, then. DocxConversion\Model\Converter\Pdf and …\Queue\PdfJob, with associated test classes?

axfelix commented 9 years ago

Perfect.

crism commented 9 years ago

Dur. The existing unoconv path will do this, if we set the output filter type to pdf.

axfelix commented 9 years ago

Yup. Sorry I didn't think to say that explicitly when I was talking about making it part of the same "env scaffolding" (needlessly indirect language on my part) earlier.

crism commented 9 years ago

We do need a new job class in the queue. Which now makes me feel a little guilty about shoehorning this into DocxConversion after all; this module was intended for making DocX, not for converting from it. We’re kind of hijacking it… I think we might be better off cloning it, now that I understand how it works better.

axfelix commented 9 years ago

I'm equally leery of creating three different modules to support Cermine, but I think that might be more of a documentation issue than anything else.

crism commented 9 years ago

It seems the approach so far has been to have very small modules, all tightly focused. CERMINE support will require word processing-to-PDF, but it seems like that would be a possibly generally-useful module, would it not? Or do we want to focus users on the existing PdfConversion which converts to HTML, thence to PDF?

axfelix commented 9 years ago

The existing PdfConversion is definitely further in scope from the existing DocxConversion, though, due to the difference between calling unoconv/libreoffice and calling wkhtmltopdf/any-future-xsl-fo-or-cassius-regions implementation. Maybe rename the existing PdfConversion module to finalPDFRender and make the new one librePDFConvert or something, and you can duplicate the existing DocxConversion almost 1:1 with a different parameter passed to uno?

crism commented 9 years ago

That’s where I was heading… we need a better naming convention, I think, incorporating both input and output. I don’t want to change everything (at least not in the middle of a sprint!), but we can just call the new one as you suggest: WpPdfConversion, maybe?

axfelix commented 9 years ago

Deeply sensible :)

crism commented 9 years ago

I’ll do that, then.

crism commented 9 years ago

@axfelix: Got a good test input PDF for CERMINE testing?

crism commented 9 years ago

Current plan is to use CERMINE to extract the article content from PDF, then pass that along to the reintegration queue.

Below is the (abridged) result of running CERMINE on module/XmpConversion/test/assets/document.pdf. Is this what we want?

<article>
  <front>
    <journal-meta />
    <article-meta>
      <abstract>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris ultri…</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="1">
      <title>-</title>
      <p>Mauris ipsum diam, iaculis blandit turpis eget, volutpat adipiscing diam…</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>
axfelix commented 9 years ago

Define "abridged" -- the front matter should be populated (and pretty well)?

axfelix commented 9 years ago

and w/r/t running it on text strings rather than PDFs -- that shouldn't be nearly as good, as it's trained to use typesetting cues (layout, word distance, etc.) to parse out different metadata elements.

crism commented 9 years ago

@axfelix: “Abridged”: I chopped out most of the paragraphs, indicated by ellipses. Otherwise, that code block is exactly what I got.

And yeah, I changed my mind on the strings. I have it installed via Composer, and ready to go, as soon as I figure out what we want to do with it…

axfelix commented 9 years ago

So you're saying you didn't get anything meaningful in article-meta, eh? That's not consistent with my results... let me dig up the way I was running it.

crism commented 9 years ago

Using the example PDF files at the CERMINE home page gives better results for the article metadata. The example in XmpConversion doesn’t have anything to extract, I guess.

axfelix commented 9 years ago

Okay, I just built from Cermine's github and it's working for me:

java -cp cermine.jar pl.edu.icm.cermine.PdfNLMContentExtractor -path ~/Desktop/test.pdf > ~/Desktop/cermineout.txt

input and output: http://pkp-udev.lib.sfu.ca/parsingdev/cermine/

crism commented 9 years ago

Extraction works now, but finfo_file() reports that the MIME type of the result is text/x_c++, inexplicably. I don’t want to nerf the test, in case garbage ends up in there, but I haven’t found a way around this yet.

axfelix commented 9 years ago

That's quite bizarre! Are you just catting the output to filename.XML?

crism commented 9 years ago

Hrm; example1.pdf yields text/x_c++ as its extracted MIME type, but example[23].pdf give text/html. PHP has a messed-up magic number file, methinks. And yes, @axfelix, the output file is called .xml.

crism commented 9 years ago

I guess I’d better just disable that test. Sad.

axfelix commented 9 years ago

Yeah.

crism commented 9 years ago

We have a converter and a queue for content extraction. @axfelix, is your piece ready to wire up to the end of that pipe? (Hello, mixed metaphor!)

axfelix commented 9 years ago

I'm about to give a talk here but I'll finish it afterward. You may want to look at my most recent commit on master if you have more time before then and you'd rather finish it up ... it's pretty half-done. I don't think I want to use the XSL approach after all; given the stitching it's probably just easier catting a couple file together

axfelix commented 9 years ago

I'm pretty tempted to chuck everything I wrote yesterday and just put in the equivalent of:

<?php
$cermine_output = file_get_contents(cermine_xml);
$metypeset_output = file_get_contents(metypeset_xml);
$cermine_front = preg_replace("<body>.*", "", $cermine_output);
$metypeset_body_and_back = preg_replace(".*?</front>", "", $metypeset_output);
$munged_xml = $cermine_front . $metypeset_body_and_back
axfelix commented 9 years ago

but this is why I have a poor track record for high-level php engineering.

crism commented 9 years ago

How do we drive that? (See my question about how these bits wire together—where are the prerequisites and dependencies given?)

axfelix commented 9 years ago

The most sensible way is to work from the commits I started yesterday, minus the XSL, since (given that we're concatenating two documents anyway) I honestly think it'll be less trouble to permit the sin of parsing XML with regex. but right now that's just the clumsy beginnings of a fork of the existing PDF conversion module. I'll have a crack at finishing it ...

axfelix commented 9 years ago

aaaaaaaaaaand just getting back to this.

axfelix commented 9 years ago

OK, just looked over your stuff and it seems great (added a few minor comments) -- can you confirm for me that you've at this point added two basically-complete, standalone queues + modules, and all that's needed is the third one that I started, then wiring them all in (as opposed to writing some overarching structure for the three CERMINE-related calls)?

crism commented 9 years ago

I believe that’s correct: there are two new queues, to which your third can be added.

I didn’t think the stage constant values were meaningful! With this tool, we have a fork in the toolchain, and options to offer users. With my local version (which is now working, sort of), I don’t see any UI for controlling the workflow—is that intended?

axfelix commented 9 years ago

At this point, I think we're OK having the two forking paths "join up" (i.e., by merging CERMINE's front matter output with meTypeset's body and back, and using that XML for the HTML/PDF/XMP/ePub transformations afterward). Eventually, we still want to make use of CERMINE's body output in a more meaningful way (runtime comparison), but that should still be a non-interactive process until one of our dev partners steps up with a preliminary WYSIWYG interface.

However, there's a definite rationale for API calls that only utilize part of the workflow, for mid-stream input (e.g. if a user already has their own manually typeset XML and wants HTML/PDF/XMP/ePub) or returning mid-stream output (if a user has a doc or pdf and wants just the BibTeX containing the citations). This came up a few times last week when we mentioned our nascent OJS plugin (https://github.com/pkp/ojs-markup) and some folks expressed wanting to incorporate different workflows. This is of definite interest once we have baseline CERMINE and ePub modules working, probably second only to upstream meTypeset contributions to improve the validity of output (as well as the accuracy, once our test suite is fully operational). It does mean getting up to speed on OJS plugin code, though, and I'm not in too much of a hurry to pull you off of the stack work now that you have a good handle on it.

The stage constant values are addressed in the API documentation at the end of the current Github readme, btw.

crism commented 9 years ago

I merged master in, so we can make sure #11 and #15 can coexist (and they seem to). All the unit tests pass. What else needs done?

axfelix commented 9 years ago

My commits are still a little incomplete and I haven't quite found the headspace to polish them off properly yet (not enough PHP brain in the past few days to figure out how to properly take two different files as input to a module). If you want to have a look I certainly won't be offended but otherwise I'll try to finish it myself this week.

crism commented 9 years ago

As per #11, I added manager logic to start the WP-to-PDF and CERMINE jobs.

The tough part is how to trigger the recombination job—since that depends on two jobs being complete. My best idea:

Pick one predecessor—whichever is likely to be slower?—and kick off the combination job. That job would monitor the database (gently) until both predecessors are confirmed complete, then actually start its work.

It would kind of need to do that, anyway, to find both of its inputs, since only a single input document is normally provided.

Does that make sense?

axfelix commented 9 years ago

Thanks! That's about as far as my thinking had gotten, too ... I'll finally have a chance to finish my module tomorrow and I'll see if that way works in practice. I can't immediately see why it shouldn't!

axfelix commented 9 years ago

Just took another pass at this, but handling the two different input file paths could use some work -- would you mind taking a look when you get a chance? In https://github.com/pkp/xmlps/commit/78616aa8e095fb153229fe0b795acc2928d48a7d. Appreciate it.

axfelix commented 9 years ago

Specifically, I tried to crib some logic around outputTmpPath from the existing PDF conversion module, and I think I'm overloading setInputFile, but I'm not immediately clear on the best way to paste the existing metypeset.xml and cermine.xml outputs to a working directory based on the output of the prior modules (which I haven't touched). If you grep for every instance of those two calls, you should see everything that's currently ambiguous.

crism commented 9 years ago

I’ll meditate on this. I’ll be out this evening, but may spend more time with it when I get back.

crism commented 9 years ago

Commented in Slack, but not here. This worked, on the server, for a simple test, but stalled out on a more complicated one (eeg_comicsans). Curiously, the job status went back to “Pending,” instead of “Processing”—something is returning a 0 somewhere that shouldn’t. Still, all the individual pieces are working, and sometimes working when wired together.

crism commented 9 years ago

I think the status was a canard. Still seeing the same results, locally, though. Trivial lorem ipsum (document.odt) gives:

| 2015-09-23 16:55:08      |        6 | Job 305 is completed                 |
| 2015-09-23 16:55:07      |        6 | Queued job (305) in queue zip        |
| 2015-09-23 16:55:06      |        6 | Queued job (305) in queue merge      |
| 2015-09-23 16:54:50      |        6 | Queued job (305) in queue cermine    |

while a more complicated one, with references (eeg_comicsans.doc), gives:

| 2015-09-23 16:58:34      |        6 | Queued job (306) in queue merge            |
| 2015-09-23 16:57:47      |        6 | Queued job (306) in queue cermine          |

… it’s now been in the merge queue for five minutes, and is clearly going nowhere. Why?

axfelix commented 9 years ago

Hey, this is definite progress! For the queue issue, though... merge shouldn't be running until after cermine does ... do we have it firing too early?

crism commented 9 years ago

The database listing there is in reverse time order. The merge starts after cermine, then zip, then done.