pkp / ots

PKP XML Parsing Service
GNU General Public License v3.0
32 stars 19 forks source link

Errors on tests after install #88

Closed Vitaliy-1 closed 7 years ago

Vitaliy-1 commented 7 years ago

After full installing (no single error, really :)) there is an error upon testing with ./start_queues.sh:

There was 1 error:
1) BibtexreferencesConversionTest\Model\Queue\BibtexreferencesJobTest::testConversion
Exception: Couldn't find the stage document
/var/www/html/module/BibtexreferencesConversion/src/BibtexreferencesConversion/Model/Queue/Job/BibtexreferencesJob.php:29
/var/www/html/module/BibtexreferencesConversion/test/BibtexreferencesConversionTest/Model/Queue/BibtexreferencesJobTest.php:47
FAILURES!
Tests: 31, Assertions: 95, Errors: 1.

First ingestion fails completely, but after 2+ I see some parsing moves in documents folder. They end with document_metypeset.xml. One Module is not working.

axfelix commented 7 years ago

OK -- can you test whether you can successfully call ParsCit directly?

You should be able to navigate to [xmlps-root]/vendor/PKP/ParsCit/.../citeExtract.pl and call it like so:

./citeExtract.pl -i xml -m extract_all /path/to/document_metypeset.xml

It's possible that perl isn't getting configured correctly, or the ParsCit install is breaking -- sorry you're having trouble with this, we've had to switch where we're pulling in upstream sources several times but it was working the last time I tested.

Vitaliy-1 commented 7 years ago

I see this after executing command:

Die in SectLabel::PreProcess::findHeaderText: start id 0 >= num lines 0

not well-formed (invalid token) at line 3, column 2, byte 31 at /usr/lib/perl5/XML/Parser.pm line 187.
 at /var/www/html/vendor/pkp/ParsCit/bin/../lib/Omni/Omnidoc.pm line 69.
axfelix commented 7 years ago

Hm, that's unusual -- suggests a special character failure, but that wouldn't be happening with our test document, and ParsCit works fine on regular UTF-8 documents in every environment I've tried. I'll look into it, but I've never seen that error before, and our ability to troubleshoot Perl is fairly limited.

Vitaliy-1 commented 7 years ago

That`s strange. When I am parsing this file (and all other that tried) here: http://pkp-udev.lib.sfu.ca/ all goes fine. Obviously ParsCit works fine here. But when I try this on my Ubuntu with ParsCit (from pkp or knmym), I got the same error.

Vitaliy-1 commented 7 years ago

When I delete line1 in document_metypeset.xml: <!DOCTYPE article PUBLIC ...> The message from ParsCit changes to:

Die in SectLabel::PreProcess::findHeaderText: start id 0 >= num lines 0
Die in SectLabel::PreProcess::findHeaderText: start id 0 >= num lines 0
Can't return outside a subroutine at ./citeExtract.pl line 255
axfelix commented 7 years ago

Hi Vitaliy,

Can you test if you can run just citeExtract.pl -m extract_citations on this file? https://raw.githubusercontent.com/pkp/xmlps/master/module/ParsCitConversion/test/assets/references.txt

Vitaliy-1 commented 7 years ago

Yep, all seems to go fine.

Citation text longer than article body: ignoring
Could not open .cite file for writing: Permission denied
Could not open .body file for writing: Permission denied
<?xml version="1.0" encoding="UTF-8"?>
<algorithms version="110505">
<algorithm name="ParsCit" version="110505">
<citationList>
<citation valid="true">
<authors>
<author>E Niedermeyer</author>
<author>F H Lopes da Silva</author>
</authors>
<title>Electroencephalography: Basic principles, clinical applications and related fields, 3rd edition,</title>
<date>1993</date>
<publisher>Wilkins,</publisher>
<location>Lippincott, Williams</location>
<marker>Niedermeyer, Silva, 1993</marker>
<rawString>E. Niedermeyer, F. H. Lopes da Silva. 1993. Electroencephalography: Basic principles, clinical applications and related fields, 3rd edition, Lippincott, Williams &amp; Wilkins, Philadelphia.</rawString>
</citation>
<citation valid="true">
<authors>
<author>H L Atwood</author>
<author>W A MacKay</author>
</authors>
<date>1989</date>
<booktitle>Essentials of neurophysiology, B.C.</booktitle>
<publisher>Decker,</publisher>
<location>Hamilton, Canada.</location>
<marker>Atwood, MacKay, 1989</marker>
<rawString>H. L. Atwood, W. A. MacKay. 1989. Essentials of neurophysiology, B.C. Decker, Hamilton, Canada.</rawString>
</citation>
<citation valid="true">
<authors>
<author>F S Tyner</author>
<author>J R Knott</author>
</authors>
<title>Fundamentals of EEG technology, Volume 1: Basic concepts and methods, Raven press,</title>
<date>1989</date>
<location>New York.</location>
<marker>Tyner, Knott, 1989</marker>
<rawString>F. S. Tyner, J. R.Knott. 1989. Fundamentals of EEG technology, Volume 1: Basic concepts and methods, Raven press, New York.</rawString>
</citation>

and so on...

axfelix commented 7 years ago

OK, then I don't think you're having a problem with ParsCit. It may be the merge module that's causing problems for you... can you increase your log level in config/autoload/local.php and see which module is failing?

Vitaliy-1 commented 7 years ago

Hmm, where errors should be displayed? I do not see anything in the web interface or apache error log. In queue_debug.out the last line seen is: ...cermine for JOB... my local.php looks like this:

<?php

return array(
    'modules' => array(
        'ZendDeveloperTools',
    ),
    'conversion' => array(
        'docx' => array(
            'unoconv' => array(
                'command' => 'unoconv',
            ),
        ),
    ),
    'doctrine' => array(
        'connection' => array(
            'orm_default' => array(
                'params' => array(
                    'user' => 'xxxxxxxx',
                    'password' => 'xxxxxxxx',
                ),
            ),
        ),
    ),
    'log' => array(
        'level' => 4,
    ),
    'view_manager' => array(
        'display_not_found_reason' => true,
        'display_exceptions' => true,
    ),
);
axfelix commented 7 years ago

If you turn that log level in the bottom of the file up to 6 or 7, and then log into the stack as an admin user, then check the "Log" in the top corner for any failed jobs.

Vitaliy-1 commented 7 years ago

I am sorry, but there is not any guideline how to login as an admin user. I need to change something in database, in user table for example? for know my 'role' is 'member' and level '0'. Role 'admin' blocks login. Or there is a link for admin login?

axfelix commented 7 years ago

Ah, sorry. Try setting level = 0 and role = "administrator" for the relevant user in the DB.

Vitaliy-1 commented 7 years ago

all logs from 1 upload from web interface:

2017/01/11 20:51:25 127.0.0.1   INFO    Queued job (18) in queue pathfinder
2017/01/11 20:51:25 127.0.0.1   INFO    Processing queue received job (18)
2017/01/11 20:51:25 127.0.0.1   INFO    A new job has been created (18)

from 'job' table

# id, creationDate, status, conversionStage, referenceParsingSuccess, inputFileFormat, citationStyleFile, config, userId
'18', '1484160685', '3', '12', '0', '0', 'vendor/citation-style-language/styles/elsevier-harvard2.csl', 'a:1:{s:7:\"outputs\";a:14:{i:0;i:15;i:1;i:16;i:2;i:14;i:3;i:3;i:4;i:4;i:5;i:6;i:6;i:7;i:7;i:11;i:8;i:8;i:9;i:9;i:10;i:10;i:11;i:17;i:12;i:18;i:13;i:5;}}', '1'

and from 'document' table

# id, path, conversionStage, mimeType, size, jobId
42, var/documents/1/18/upload/mucharska.docx, 15, application/vnd.openxmlformats-officedocument.wordprocessingml.document, 35642, 18
43, var/documents/1/18/document.docx, 1, application/vnd.openxmlformats-officedocument.wordprocessingml.document, 25695, 18
44, var/documents/1/18/document_metypeset.xml, 2, text/html, 114940, 18
45, var/documents/1/18/document_from_wp.pdf, 12, application/pdf, 142297, 18
axfelix commented 7 years ago

Hi Vitaliy,

That still looks like output from log level 4 rather than 7. You should restart the queues (using the start_queues.sh script -- this is part of what the cronjob does) for any changes to the config file to take effect. It's very unlikely that it'd be dying right after pathfinder...

Vitaliy-1 commented 7 years ago

That`s it :)

Additional info from logs:

2017/01/11 21:12:23     INFO    Job 19 failed.
2017/01/11 21:12:23     DEBUG   Unoconf output:
Verbosity set to level 3
DEBUG: Connection type: socket,host=127.0.0.1,port=2002;urp;StarOffice.ComponentContext
DEBUG: Existing listener not found.
DEBUG: Launching our own listener using /usr/lib/libreoffice/program/soffice.bin.
DEBUG: Process /usr/lib/libreoffice/program/soffice.bin (pid=5318) exited with 81.
Error: Unable to connect or start own listener. Aborting.
Using office base path: /usr/lib/libreoffice
Using office binary path: /usr/lib/libreoffice/program
LibreOffice listener successfully started. (pid=5318)
2017/01/11 21:12:21     DEBUG   Unoconv is executing:
unoconv -vvv -f 'docx7' -o 'var/documents/1/19/document.docx' 'var/documents/1/19/upload/mucharska.docx' 2>&1
2017/01/11 21:12:20     INFO    Queued job (19) in queue docx
2017/01/11 21:12:19 127.0.0.1   INFO    Queued job (19) in queue pathfinder
2017/01/11 21:12:19 127.0.0.1   INFO    Processing queue received job (19)
2017/01/11 21:12:19 127.0.0.1   INFO    A new job has been created (19) 
axfelix commented 7 years ago

Aha, much better -- OK, looks like you're having unoconv fail rather than ParsCit.

Make sure you've restarted the queues as the webserver user -- you've had LibreOffice die in the background, but that could just be the first-run issue if that's the first time it was running as your own user. Also, make sure you can successfully run that unoconv command on the same document on its own, and that you've installed all the requirements from the vagrant shell script.

Vitaliy-1 commented 7 years ago

Hmm, additional run:

2017/01/11 21:27:14     INFO    Job 20 failed.
2017/01/11 21:27:14     DEBUG   Couldn't load command output xml. LIBXML error:
Document is empty
2017/01/11 21:27:14     DEBUG   CERMINE output:
Error: Could not find or load main class pl.edu.icm.cermine.PdfNLMContentExtractor
2017/01/11 21:27:13     DEBUG   CERMINE is executing:
java -cp 'vendor/CeON/CERMINE/cermine-impl-1.8-jar-with-dependencies.jar' 'pl.edu.icm.cermine.PdfNLMContentExtractor' -path 'var/documents/1/20/document_from_wp.pdf' 2>&1 >var/documents/1/20/document_from_pdf.xml
2017/01/11 21:27:13     INFO    Starting CERMINE-based extraction.
2017/01/11 21:27:13     INFO    Queued job (20) in queue cermine
2017/01/11 21:27:13     DEBUG   Unoconf output:
Verbosity set to level 3
DEBUG: Connection type: socket,host=127.0.0.1,port=2002;urp;StarOffice.ComponentContext
DEBUG: Existing listener not found.
DEBUG: Launching our own listener using /usr/lib/libreoffice/program/soffice.bin.
Input file: var/documents/1/20/upload/mucharska.docx
DEBUG: Terminating LibreOffice instance.
DEBUG: Waiting for LibreOffice instance to exit.
Using office base path: /usr/lib/libreoffice
Using office binary path: /usr/lib/libreoffice/program
LibreOffice listener successfully started. (pid=5474)
Selected output format: Portable Document Format [.pdf]
Selected office filter: writer_pdf_Export
Used doctype: document
Output file: var/documents/1/20/document_from_wp.pdf
LibreOffice instance unsuccessfully closed, sending TERM signal.
2017/01/11 21:27:11     DEBUG   Unoconv is executing:
unoconv -vvv -f 'pdf' -o 'var/documents/1/20/document_from_wp.pdf' 'var/documents/1/20/upload/mucharska.docx' 2>&1
2017/01/11 21:27:10     INFO    Queued job (20) in queue wppdf
2017/01/11 21:27:10     DEBUG   meTypeset output:
2017/01/11 21:26:56     DEBUG   meTypeset is executing:
export PYTHONIOENCODING=UTF-8; HOME=/tmp vendor/MartinPaulEve/meTypeset/bin/meTypeset.py -d --nogit 'docx' 'var/documents/1/20/document.docx' 'var/documents/1/20/metypeset' 2>&1 >/dev/null
2017/01/11 21:26:55     INFO    Queued job (20) in queue nlmxml
2017/01/11 21:26:55     DEBUG   Unoconf output:
Verbosity set to level 3
DEBUG: Connection type: socket,host=127.0.0.1,port=2002;urp;StarOffice.ComponentContext
DEBUG: Existing listener not found.
DEBUG: Launching our own listener using /usr/lib/libreoffice/program/soffice.bin.
Input file: var/documents/1/20/upload/mucharska.docx
DEBUG: Terminating LibreOffice instance.
DEBUG: Waiting for LibreOffice instance to exit.
Using office base path: /usr/lib/libreoffice
Using office binary path: /usr/lib/libreoffice/program
LibreOffice listener successfully started. (pid=5412)
Selected output format: Microsoft Office Open XML [.docx]
Selected office filter: MS Word 2007 XML
Used doctype: document
Output file: var/documents/1/20/document.docx
LibreOffice instance unsuccessfully closed, sending TERM signal.
2017/01/11 21:26:53     DEBUG   Unoconv is executing:
unoconv -vvv -f 'docx7' -o 'var/documents/1/20/document.docx' 'var/documents/1/20/upload/mucharska.docx' 2>&1
2017/01/11 21:26:52     INFO    Queued job (20) in queue docx
2017/01/11 21:26:52 127.0.0.1   INFO    Queued job (20) in queue pathfinder
2017/01/11 21:26:52 127.0.0.1   INFO    Processing queue received job (20)
2017/01/11 21:26:52 127.0.0.1   INFO    A new job has been created (20)
axfelix commented 7 years ago

Hm. It looks like your updates to the composer file may have broken something, leading to Cermine not being installed correctly. I just verified to the link to the build at http://maven.ceon.pl/artifactory/simple/kdd-releases/pl/edu/icm/cermine/cermine-impl/1.8/cermine-impl-1.8-jar-with-dependencies.jar still works...

Vitaliy-1 commented 7 years ago

I am not expert in Java, but I have this jar file in /var/html/vendor/CeON/CERMINE directory. I have also downloaded it from the link and used this command upon pdf file in same directory: java -cp cermine-impl-1.8-jar-with-dependencies.jar pl.edu.icm.cermine.ContentExtractor -path 2.pdf Which returned an error: Error: Could not find or load main class pl.edu.icm.cermine.ContentExtractor

Vitaliy-1 commented 7 years ago

Aha. The good news is it requires path to directory, not to file. And the bad, that 1.8 version still not work. But 1.11 works. Should I replace it in web directory?

axfelix commented 7 years ago

Yeah, I need to update the version of the Cermine used by the stack anyhow -- it's on my list of things to do the next time I get a chance to work on this project, hopefully next week.

Vitaliy-1 commented 7 years ago

Thanks!

After upgrading cermine (and global.php accordingly) the document was finally parsed :)

axfelix commented 7 years ago

Great! Thanks for being so patient about troubleshooting. I'll make sure I make the necessary changes to the composer file upstream asap.

akm479 commented 4 years ago

I am not expert in Java, but I have this jar file in /var/html/vendor/CeON/CERMINE directory. I have also downloaded it from the link and used this command upon pdf file in same directory: java -cp cermine-impl-1.8-jar-with-dependencies.jar pl.edu.icm.cermine.ContentExtractor -path 2.pdf Which returned an error: Error: Could not find or load main class pl.edu.icm.cermine.ContentExtractor

Hey I am finding the same issue can you please tell me how did you resolve it