pangaea-data-publisher / fuji

FAIRsFAIR Research Data Object Assessment Service
MIT License
53 stars 39 forks source link

Failing to run fuji in a container #125

Closed livenson closed 3 years ago

livenson commented 3 years ago

Hi,

after the last upgrade to the latest master, I started seeing errors when running in pure python docker container:

2021-01-19 13:26:42,341 - werkzeug - INFO - 192.168.144.4 - - [19/Jan/2021 13:26:42] "POST /fuji/api/v1/evaluate HTTP/1.1" 500 -
2021-01-19 13:26:56,732 - tika.tika - ERROR - Unable to run java; is it installed?
/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'dataverse.no'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
 warnings.warn(
2021-01-19 13:26:56,732 [Thread-85  ] [ERROR] Unable to run java; is it installed?
2021-01-19 13:26:56,734 - tika.tika - ERROR - Failed to receive startup confirmation from startServer.
2021-01-19 13:26:56,734 [Thread-85  ] [ERROR] Failed to receive startup confirmation from startServer.
2021-01-19 13:26:56,755 - werkzeug - INFO - 192.168.144.4 - - [19/Jan/2021 13:26:56] "POST /fuji/api/v1/evaluate HTTP/1.1" 500 -

Seems that for some reason it started triggering download of tika-server!

I've added JRE env to Docker, but I cannot get tika-server to properly start and log messages are scarce:

/tmp # cat tika-server.log 
Jan 19, 2021 2:50:35 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Jan 19, 2021 2:50:35 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
INFO  Starting Apache Tika 1.24 server
INFO  Setting the server's publish address to be http://localhost:9998/
INFO  Logging initialized @2072ms to org.eclipse.jetty.util.log.Slf4jLog
INFO  jetty-9.4.24.v20191120; built: 2019-11-20T21:37:49.771Z; git: 363d5f2df3a8a28de40604320230664b9c793c16; jvm 1.8.0_252-b09
INFO  Started ServerConnector@4ddbbdf8{HTTP/1.1,[http/1.1]}{localhost:9998}
INFO  Started @2238ms
WARN  Empty contextPath
INFO  Started o.e.j.s.h.ContextHandler@5f354bcf{/,null,AVAILABLE}
INFO  Started Apache Tika server at http://localhost:9998/
INFO  JVM Runtime does not support Modules
INFO  rmeta/text (autodetecting type)
INFO  rmeta/text (autodetecting type)
INFO  rmeta/text (autodetecting type)

/tmp # cat tika.log 
2021-01-19 14:50:28,581 [Thread-10   ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /tmp/tika-server.jar.
2021-01-19 14:50:33,641 [Thread-10   ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to /tmp/tika-server.jar.md5.
2021-01-19 14:50:34,700 [Thread-10   ] [WARNI]  Failed to see startup log message; retrying...

What am I doing wrong? Ideally I would prefer server to get all dependencies during building of container, not in runtime.

huberrob commented 3 years ago

From these log messages it seems as if your TIKA environment variables are not correctly set. As far as I know you need to set TIKA_SERVER_JAR which should point to your local tika jar file.

On Tue, Jan 19, 2021 at 4:11 PM Ilja Livenson notifications@github.com wrote:

Hi,

after the last upgrade to the latest master, I started seeing errors when running in pure python docker container:

2021-01-19 13:26:42,341 - werkzeug - INFO - 192.168.144.4 - - [19/Jan/2021 13:26:42] "POST /fuji/api/v1/evaluate HTTP/1.1" 500 - 2021-01-19 13:26:56,732 - tika.tika - ERROR - Unable to run java; is it installed? /usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'dataverse.no'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings warnings.warn( 2021-01-19 13:26:56,732 [Thread-85 ] [ERROR] Unable to run java; is it installed? 2021-01-19 13:26:56,734 - tika.tika - ERROR - Failed to receive startup confirmation from startServer. 2021-01-19 13:26:56,734 [Thread-85 ] [ERROR] Failed to receive startup confirmation from startServer. 2021-01-19 13:26:56,755 - werkzeug - INFO - 192.168.144.4 - - [19/Jan/2021 13:26:56] "POST /fuji/api/v1/evaluate HTTP/1.1" 500 -

Seems that for some reason it started triggering download of tika-server!

I've added JRE env to Docker, but I cannot get tika-server to properly start and log messages are scarce:

/tmp # cat tika-server.log Jan 19, 2021 2:50:35 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies.

Jan 19, 2021 2:50:35 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version. INFO Starting Apache Tika 1.24 server INFO Setting the server's publish address to be http://localhost:9998/ INFO Logging initialized @2072ms to org.eclipse.jetty.util.log.Slf4jLog INFO jetty-9.4.24.v20191120; built: 2019-11-20T21:37:49.771Z; git: 363d5f2df3a8a28de40604320230664b9c793c16; jvm 1.8.0_252-b09 INFO Started ServerConnector@4ddbbdf8{HTTP/1.1,[http/1.1]}{localhost:9998} INFO Started @2238ms WARN Empty contextPath INFO Started o.e.j.s.h.ContextHandler@5f354bcf{/,null,AVAILABLE} INFO Started Apache Tika server at http://localhost:9998/ INFO JVM Runtime does not support Modules INFO rmeta/text (autodetecting type) INFO rmeta/text (autodetecting type) INFO rmeta/text (autodetecting type)

/tmp # cat tika.log 2021-01-19 14:50:28,581 [Thread-10 ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /tmp/tika-server.jar. 2021-01-19 14:50:33,641 [Thread-10 ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to /tmp/tika-server.jar.md5. 2021-01-19 14:50:34,700 [Thread-10 ] [WARNI] Failed to see startup log message; retrying...

What am I doing wrong? Ideally I would prefer server to get all dependencies during building of container, not in runtime.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangaea-data-publisher/fuji/issues/125, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACW5R4C74FS2HGMM77J5ODS2WOITANCNFSM4WI5IXSA .

-- Dr. Robert Huber,

PANGAEA - www.pangaea.de


MARUM - Center for Marine Environmental Sciences University Bremen Leobener Strasse POB 330 440 28359 Bremen Phone ++49 421 218-65593, Fax ++49 421 218-65505 e-mail rhuber@uni-bremen.de

livenson commented 3 years ago

2021-01-19 14:50:28,581 [Thread-10 ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /tmp/tika-server.jar. 2021-01-19 14:50:33,641 [Thread-10 ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to /tmp/tika-server.jar.md5.

Seems that this file is downloaded on the background -- should I simply set TIKA_SERVER_JAR then to point to that location? I will try out.

huberrob commented 3 years ago

Any news on this? Could you solve the issue by setting the environment variable?

livenson commented 3 years ago

Sorry, forgot to update - nope, still the same issue. That variable is an option if external TIKA server location is used, but actually the issue is that it doesn't launch (any more?) for some reason.

The Dockerfile I'm trying with is below. And it used to work also openjdk8.

FROM python:3.8-alpine

RUN apk add --update \
    g++ \
    gcc \
    libffi-dev \
    openssl-dev \
    python3-dev \
    libxslt-dev \
    libc-dev \
    libxml2-dev \
    build-base \
    openjdk8-jre \
  && rm -rf /var/cache/apk/*

# set the working directory in the container
WORKDIR /code

# copy the dependencies file to the working directory
COPY requirements.txt .

# install dependencies
RUN pip install -r requirements.txt

# copy the content of the local src directory to the working directory
COPY fuji_server ./fuji_server

EXPOSE 1071

# command to run on container start
CMD [ "python3", "-m", "fuji_server", "-c", "fuji_server/config/server.ini" ]
livenson commented 3 years ago

Ok, it might have been a wrong symptom. I noticed that 500 is returned also for the incorrect? user input.

When querying via swagger, I also get 500 -- but with a better error message:

{
  "detail": "True is not of type 'string'\n\nFailed validating 'type' in schema['properties']['request']['additionalProperties']:\n    {'type': 'string'}\n\nOn instance['request']['use_datacite']:\n    True",
  "status": 500,
  "title": "Response body does not conform to specification",
  "type": "about:blank"
}

Does it look familiar?

Input:

curl -X POST "https://fair.etais.ee/fuji/api/v1/evaluate" -H "accept: application/json" -H "Authorization: Basic XXX" -H "Content-Type: application/json" -d "{\"oaipmh_endpoint\":\"\",\"object_identifier\":\"https://doi.org/10.1594/PANGAEA.908011\",\"test_debug\":true,\"use_datacite\":true}"

livenson commented 3 years ago

Ok, seems that the issue was fixed with the latest commits, upgrding to latest got if fixed.