pkp / ots

PKP XML Parsing Service
GNU General Public License v3.0
32 stars 19 forks source link

Inconsistent result in the API while retrieving a parsing result. #109

Closed fabiobatalha closed 6 years ago

fabiobatalha commented 6 years ago

When attempting to retrieve the parsing results of a job which were successfully parsed the API retrieve an error (Couldn't find document).

ex: Parsing Status request

http://pkp-udev.lib.sfu.ca/api/job/status?email=fabio.batalha@erudit.org&access_token=mytoken&id=2020

Response

{"status":"success","jobStatus":2,"jobStatusDescription":"Completed","flashMessages":[]}

status

Attempt to retrieve parsed document

http://pkp-udev.lib.sfu.ca/api/job/retrieve?email=fabio.batalha@erudit.org&access_token=mytoken&id=2020&conversionStage=14

Response

{"status":"error","error":"Couldn\u0027t find document","flashMessages":[]}

retrieve

axfelix commented 6 years ago

I can only reproduce this on a few document stages -- I can confirm that 14 isn't working, but e.g. 15 is. It looks like we might've broken a few document stage registrations for the API; I'll look into this.

axfelix commented 6 years ago

@kaschioudi

axfelix commented 6 years ago

Only guess is that I might've somehow broken this with the last line of this commit? https://github.com/pkp/ots/commit/4b651951d4adc45fa67a0dc74db86d7b6a038b92

I stopped passing job to the merge function because we were no longer testing for different input file formats in the merge itself, just in the job wrapper, but it's possible this could somehow be preventing the document stage registration hook to work properly (that code has always been confusing to me).

kaschioudi commented 6 years ago

I am unable to reproduce this.

Please note that you will get that error message in case that conversion stage did not run or there was no document produced at that stage. Please make sure that the targeted stage was executed during the conversion and also that a document was generated during that stage.

For a given job id, By running the query below off the database you can obtain the list of available documents and their associated stages.

select * from document where jobId=## order by conversionStage

kaschioudi commented 6 years ago

@axfelix : Could you please run select * from document where jobId=2020 order by conversionStage on http://pkp-udev.lib.sfu.ca mysql server and let me know the output?

axfelix commented 6 years ago

mysql> select * from document where jobId=2020 order by conversionStage; +-------+----------------------------------------------+-----------------+----------------------+--------+-------+ | id | path | conversionStage | mimeType | size | jobId | +-------+----------------------------------------------+-----------------+----------------------+--------+-------+ | 19492 | var/documents/422/2020/document.bib | 3 | text/plain | 24237 | 2020 | | 19491 | var/documents/422/2020/document.xml | 5 | application/xml | 102053 | 2020 | | 19494 | var/documents/422/2020/html.zip | 7 | application/zip | 164943 | 2020 | | 19495 | var/documents/422/2020/document.pdf | 9 | application/pdf | 107274 | 2020 | | 19496 | var/documents/422/2020/documents.zip | 10 | application/zip | 548640 | 2020 | | 19493 | var/documents/422/2020/document.epub | 11 | application/epub+zip | 65524 | 2020 | | 19489 | var/documents/422/2020/document_from_pdf.xml | 13 | application/xml | 104565 | 2020 | | 19488 | var/documents/422/2020/upload/1040393ar.pdf | 16 | application/pdf | 226238 | 2020 | | 19490 | var/documents/422/2020/grobid.xml | 19 | application/xml | 9043 | 2020 | +-------+----------------------------------------------+-----------------+----------------------+--------+-------+ 9 rows in set (0.00 sec)

Looks like several stages failed to register, which is consistent with the API behaviour we're seeing, I'm just not sure how...

kaschioudi commented 6 years ago

stages did not fail to register. looks like documents are associated to the last stage that interacted with them. I ran a conversion of eeg_comicsans.docx and watched the logs to follow the whole conversion step by step.

The conversion happened through the following stages: nlmxml => wppdf => cermine => grobid => merge => references => bibtexreferences => epub => html => citationstyle => pdf => xmp => zip

Then I ran the following queries in mysql

mysql> select * from document where jobId=3442 order by id ; +-------+------------------------------------------------+-----------------+-------------------------------------------------------------------------+---------+-------+ | id | path | conversionStage | mimeType | size | jobId | +-------+------------------------------------------------+-----------------+-------------------------------------------------------------------------+---------+-------+ | 23436 | var/documents/1/3442/upload/eeg_comicsans.docx | 15 | application/zip | 719136 | 3442 | | 23437 | var/documents/1/3442/document.docx | 1 | application/vnd.openxmlformats-officedocument.wordprocessingml.document | 719571 | 3442 | | 23438 | var/documents/1/3442/document_metypeset.xml | 2 | text/html | 73333 | 3442 | | 23439 | var/documents/1/3442/document_from_wp.pdf | 12 | application/pdf | 352890 | 3442 | | 23440 | var/documents/1/3442/document_from_pdf.xml | 13 | application/xml | 59412 | 3442 | | 23441 | var/documents/1/3442/grobid.xml | 19 | application/xml | 2548 | 3442 | | 23442 | var/documents/1/3442/document.xml | 5 | application/xml | 76783 | 3442 | | 23443 | var/documents/1/3442/document.bib | 3 | text/plain | 6795 | 3442 | | 23444 | var/documents/1/3442/document.epub | 11 | application/epub+zip | 1364994 | 3442 | | 23445 | var/documents/1/3442/html.zip | 7 | application/zip | 1469145 | 3442 | | 23446 | var/documents/1/3442/document.pdf | 9 | application/pdf | 496363 | 3442 | | 23447 | var/documents/1/3442/documents.zip | 10 | application/zip | 4003112 | 3442 | +-------+------------------------------------------------+-----------------+-------------------------------------------------------------------------+---------+-------+

mysql> select argument from mysql.general_log where argument like '%INSERT INTO document%'; +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | argument | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | INSERT INTO document (path, conversionStage, mimeType, size, jobId) VALUES ('var/documents/1/3442/upload/eeg_comicsans.docx', 0, 'application/zip', 719136, 3442) | | INSERT INTO document (path, conversionStage, mimeType, size, jobId) VALUES ('var/documents/1/3442/document.docx', 1, 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 719571, 3442) | | INSERT INTO document (path, conversionStage, mimeType, size, jobId) VALUES ('var/documents/1/3442/document_metypeset.xml', 2, 'text/html', 73333, 3442) | | INSERT INTO document (path, conversionStage, mimeType, size, jobId) VALUES ('var/documents/1/3442/document_from_wp.pdf', 12, 'application/pdf', 352890, 3442) | | INSERT INTO document (path, conversionStage, mimeType, size, jobId) VALUES ('var/documents/1/3442/document_from_pdf.xml', 13, 'application/xml', 59412, 3442) | | INSERT INTO document (path, conversionStage, mimeType, size, jobId) VALUES ('var/documents/1/3442/grobid.xml', 19, 'application/xml', 2548, 3442) | | INSERT INTO document (path, conversionStage, mimeType, size, jobId) VALUES ('var/documents/1/3442/document.xml', 14, 'application/xml', 72956, 3442) | | INSERT INTO document (path, conversionStage, mimeType, size, jobId) VALUES ('var/documents/1/3442/document.bib', 3, 'text/plain', 6795, 3442) | | INSERT INTO document (path, conversionStage, mimeType, size, jobId) VALUES ('var/documents/1/3442/document.epub', 11, 'application/epub+zip', 1364994, 3442) | | INSERT INTO document (path, conversionStage, mimeType, size, jobId) VALUES ('var/documents/1/3442/html.zip', 6, 'application/zip', 1468751, 3442) | | INSERT INTO document (path, conversionStage, mimeType, size, jobId) VALUES ('var/documents/1/3442/document.pdf', 8, 'application/pdf', 487366, 3442) | | INSERT INTO document (path, conversionStage, mimeType, size, jobId) VALUES ('var/documents/1/3442/documents.zip', 10, 'application/zip', 4003112, 3442) | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

mysql> select argument from mysql.general_log where argument like '%UPDATE document%'; +--------------------------------------------------------------------------+ | argument | +--------------------------------------------------------------------------+ | UPDATE document SET conversionStage = 15 WHERE id = 23436 | | UPDATE document SET conversionStage = 5, size = 76783 WHERE id = 23442 | | UPDATE document SET conversionStage = 7, size = 1469145 WHERE id = 23445 | | UPDATE document SET conversionStage = 9, size = 496363 WHERE id = 23446 | +--------------------------------------------------------------------------+

Long story short, we can see that even though document.xml (id 23442) was associated to stage 14 (JOB_CONVERSION_STAGE_XML_MERGE) initially. Later the document is associated to stage 5 (JOB_CONVERSION_STAGE_BIBTEXREFERENCES).

axfelix commented 6 years ago

That explains why this code has always confused me! I think the original approach taken by our previous developer is actually quite brittle and makes implementation difficult ...

@kaschioudi what do you think it would take to rewrite this to no longer de-register documents from previous stages?

kaschioudi commented 6 years ago

@axfelix : I need to look deeper into DocumentDAO implementation. But a side affect I see with implementing such change is that doing so would also mean keeping multiple versions of some documents. For instance, we will end with multiple versions of document.xml. Since the documents are all stored in the job root folder we need to think about which strategy to adopt to avoid name collisions. Any suggestions?

axfelix commented 6 years ago

We can use descriptive names derived from the job stages -- as long as they don't collide we can always revise them later.

kaschioudi commented 6 years ago

alright. I will look into this.

kaschioudi commented 6 years ago

I have identified 4 modules which de-register documents from previous stages: JOB_CONVERSION_STAGE_BIBTEXREFERENCES, JOB_CONVERSION_STAGE_CITATIONSTYLE, JOB_CONVERSION_STAGE_XMP and JOB_CONVERSION_STAGE_BIBTEX. The fix for the first three are ready to go. However, I am not sure how to proceed with the last module.

JOB_CONVERSION_STAGE_BIBTEX checks to see if document.bib had been previously created (cf https://github.com/pkp/ots/blob/master/module/BibtexConversion/src/BibtexConversion/Model/Queue/Job/BibtexJob.php#L44). In case the document did not exist, it's created and associated to JOB_CONVERSION_STAGE_BIBTEX stage. Otherwise, only the file size is update and the document remains associated to JOB_CONVERSION_STAGE_REFERENCES.

I could make the change so that both documents are registered. However my concern is that citation style module later tries to fetch document.bib by looking first for a document associated to JOB_CONVERSION_STAGE_REFERENCES and then JOB_CONVERSION_STAGE_BIBTEX (cf https://github.com/pkp/ots/blob/master/module/CitationstyleConversion/src/CitationstyleConversion/Model/Queue/Job/CitationstyleJob.php#L33-L37).

@axfelix : Will the citation style approach for loading document.bib still be valid if multiple document.bib are registered?

kaschioudi commented 6 years ago

@axfelix : Here's the PR https://github.com/pkp/ots/pull/112

When I initially made the change, the conversion was silently failing and I couldn't figure out why. I believe it was due to the rules on the Document entity, more specifically the unique rule on the path property (cf. https://github.com/pkp/ots/blob/master/module/Manager/src/Manager/Entity/Document.php#L24-L27).

So the workaround was to

axfelix commented 6 years ago

Thanks for the quick work on this! When we first took over this codebase, this always seemed like the kind of thing that should be self-enforcing and it was very confusing to discover that it was not.

If I'm following you correctly -- now that both documents can be registered, would it make sense to just invert the behaviour in https://github.com/pkp/ots/blob/master/module/CitationstyleConversion/src/CitationstyleConversion/Model/Queue/Job/CitationstyleJob.php#L33? So that JOB_CONVERSION_STAGE_BIBTEX is checked before JOB_CONVERSION_STAGE_REFERENCES? I think the intent here was to avoid failing in cases where BibTeX parsing was never called (e.g. no references detected, or all references successfully looked up via CrossRef so no local parsing needed), but if the JOB_CONVERSION_STAGE_BIBTEX is now always newer as long as it exists... shouldn't it be checked first, with JOB_CONVERSION_STAGE_REFERENCES as the fallback?

axfelix commented 6 years ago

Also, please keep in mind I'm speculating here and you probably know better than I do -- this month is going to be a little frustrating as there's a lot of attention on OTS right now but I'm pretty busy with other projects, so I really appreciate your thoughtful work on this.

kaschioudi commented 6 years ago

You are correct! if we go with registering both documents, the logic should be inverted. Look for JOB_CONVERSION_STAGE_BIBTEX and fallback to JOB_CONVERSION_STAGE_REFERENCES.

So okay to register both document.bib documents?

axfelix commented 6 years ago

It should be! It's worth implementing that and testing, yeah.

kaschioudi commented 6 years ago

@axfelix : I made the change to JOB_CONVERSION_STAGE_BIBTEX module as well. This is now good for testing. Please give it a try and let me know if there's anything else.

axfelix commented 6 years ago

@fabiobatalha this is fixed! thanks @kaschioudi ! I've also added a note in our readme to clarify that the final XML is document stage 5.

fabiobatalha commented 6 years ago

thanks @axfelix

I will try it asap.

fabiobatalha commented 6 years ago

@axfelix

The same issue again with a different document.

It is available through the interface, but the API retrieve a not found error.

Job: 2631

screenshot from 2018-02-22 12-37-31

axfelix commented 6 years ago

I think this is happening because no references were detected, so you'd have to request stage 14 instead of stage 5 to get the final XML, but that's really unintuitive behaviour...

@kaschioudi -- what can we do about this? I think this explains the behaviour I was confused by earlier in the week.

fabiobatalha commented 6 years ago

@axfelix

Yes it is really unintuitive, for the user perspective it would be transparent even using the stage 5.

axfelix commented 6 years ago

Yeah, sorry, we'll fix this.

fabiobatalha commented 6 years ago

no problem!

maybe the convertionStage parameter could receive a list of desired formats, eg: conversionStage=5,14

if the first is not available the API try to retrieve the next expected format.

axfelix commented 6 years ago

I think the cleanest solution here (keeping the solution to a single API call while not violating our existing logic around modules) would be to add an additional module that always contains the final JATS XML output -- the behaviour would be to copy the document from stage 5 if it exists, or from stage 14 if it doesn't.

@kaschioudi , can you take this?

kaschioudi commented 6 years ago

@axfelix : here's the PR https://github.com/pkp/ots/pull/119

ps: xmlfinal module will run after epub because bibtexreferences conversion does not always run.

axfelix commented 6 years ago

Final stage is now always 20. @fabiobatalha

Thanks, @kaschioudi

fabiobatalha commented 6 years ago

@axfelix @kaschioudi Tested and working!

Thanks