ipfs-search / ipfs-tika

Java web application taking IPFS hashes, extracting (textual) content and metadata through Apache's Tika.
GNU Affero General Public License v3.0
31 stars 5 forks source link

Fix language detection #6

Closed dokterbob closed 5 years ago

dokterbob commented 5 years ago

https://github.com/ipfs-search/ipfs-tika/blob/master/src/main/java/com/ipfssearch/ipfstika/App.java#L142

LastExile16 commented 5 years ago

What is the problem with language detection?

dokterbob commented 5 years ago

I don’t know exactly, ipfs-tike crashes.

You should be able to replicate it by uncommenting the related code, compiling it and then having it analyse some file.

Any feedback you can produce (including tracebacks) much appreciated!

LastExile16 commented 5 years ago

having localhost on port 8081, the server gets auto-requests from ipfs daemon with path=/ which means the request is localhost:8081/ that results in IOException: internal server error. I have changed the port randomly to 8090 and it stopped getting such requests.

I used python code to make requests to the ipfs-search and push the returned hashes into ipfs-tika. beside of that, I have added a timeout of 30 seconds to the python request because some requests will take forever.

I tested the ipfs-tika on tika-1.19 for 20 pages of hashes without experiencing any crashes

One thing to note:

even SOCKET_READ_TIMEOUT is set, the nanohttpd doesn't interrupt, rather, it will continue. after some time the response is returned and sent back to python which is already closed the connection. this results in

fi.iki.elonen.NanoHTTPD$Response send
SEVERE: Could not send response to the client
...

for example QmT4f6M5mHkMvEGhKW824hkLkJ25XTwQ3ZasWQbuKHocXB requested on page 1, but by the time the response returned python was requesting page 9.

So, all timeouted requests you see below has returned in ipfs-tika java file either with the proper result or FileNotFoundException result.

Internal server error:
java.io.FileNotFoundException: http://localhost:8080/ipfs/Qma9uWJNVfQ18TGTLNdaVNtmMomVJwSEatGZQvrEPDDu77
Dec 19, 2018 6:45:56 PM fi.iki.elonen.NanoHTTPD$Response send
SEVERE: Could not send response to the client
java.net.SocketException: Broken pipe (Write failed)
    at java.net.SocketOutputStream.socketWrite0(Native Method)
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
    at java.net.SocketOutputStream.write(SocketOutputStream.java:143)
    at fi.iki.elonen.NanoHTTPD$Response$ChunkedOutputStream.write(NanoHTTPD.java:1259)
    at fi.iki.elonen.NanoHTTPD$Response$ChunkedOutputStream.write(NanoHTTPD.java:1252)
    at java.util.zip.GZIPOutputStream.writeHeader(GZIPOutputStream.java:182)
    at java.util.zip.GZIPOutputStream.<init>(GZIPOutputStream.java:94)
    at java.util.zip.GZIPOutputStream.<init>(GZIPOutputStream.java:109)
    at fi.iki.elonen.NanoHTTPD$Response.sendBodyWithCorrectEncoding(NanoHTTPD.java:1449)
    at fi.iki.elonen.NanoHTTPD$Response.sendBodyWithCorrectTransferAndEncoding(NanoHTTPD.java:1440)
    at fi.iki.elonen.NanoHTTPD$Response.send(NanoHTTPD.java:1429)
    at fi.iki.elonen.NanoHTTPD$HTTPSession.execute(NanoHTTPD.java:852)
    at fi.iki.elonen.NanoHTTPD$ClientHandler.run(NanoHTTPD.java:189)
    at java.lang.Thread.run(Thread.java:748)

Result:

  1. Java file console:
    
    ipfs-tika accepting requests at: http://localhost:8090/ 

Fetching: http://localhost:8080/ipfs/QmNsYTnm132vXQ4FDZAH9qcdg9hB7sKHVpubvAVQjreeBN SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/sehome/home/nawras/.m2/repository/org/apache/tika/tika-app/1.19.1/tika-app-1.19.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/sehome/home/nawras/.m2/repository/org/slf4j/slf4j-log4j12/1.7.25/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Dec 19, 2018 4:50:45 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies.

Dec 19, 2018 4:50:45 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version. Parsing: http://localhost:8080/ipfs/QmNsYTnm132vXQ4FDZAH9qcdg9hB7sKHVpubvAVQjreeBN (QmNsYTnm132vXQ4FDZAH9qcdg9hB7sKHVpubvAVQjreeBN) Language name :: NONE (0.000000)

Fetching: http://localhost:8080/ipfs/QmUgEm6MJx5XN98qnYz4RaSgzaNYnqPd1dDXRiNvY1G7m6 Parsing: http://localhost:8080/ipfs/QmUgEm6MJx5XN98qnYz4RaSgzaNYnqPd1dDXRiNvY1G7m6 (QmUgEm6MJx5XN98qnYz4RaSgzaNYnqPd1dDXRiNvY1G7m6) Language name :en: HIGH (0.999995)

Fetching: http://localhost:8080/ipfs/QmdmRUBm5KaC5ijJmWQi9BmbUCCuBapeaqQdDPgw93s39V Parsing: http://localhost:8080/ipfs/QmdmRUBm5KaC5ijJmWQi9BmbUCCuBapeaqQdDPgw93s39V (QmdmRUBm5KaC5ijJmWQi9BmbUCCuBapeaqQdDPgw93s39V) Language name :en: HIGH (0.999995)

Fetching: http://localhost:8080/ipfs/QmWQFDGTojPsr3hKnKSRE5VwJkcqAUaQZPPGpnmhBHgGg9 Parsing: http://localhost:8080/ipfs/QmWQFDGTojPsr3hKnKSRE5VwJkcqAUaQZPPGpnmhBHgGg9 (QmWQFDGTojPsr3hKnKSRE5VwJkcqAUaQZPPGpnmhBHgGg9) Language name :en: HIGH (0.999994)

Fetching: http://localhost:8080/ipfs/QmRHoxNNRVeRpcTvezmqnifEvz2YtnaZDv77x5HJxnwocT Parsing: http://localhost:8080/ipfs/QmRHoxNNRVeRpcTvezmqnifEvz2YtnaZDv77x5HJxnwocT (QmRHoxNNRVeRpcTvezmqnifEvz2YtnaZDv77x5HJxnwocT) Language name :en: HIGH (0.999995)

Fetching: http://localhost:8080/ipfs/QmYmrzYVU1yirGiiceJ1DZYQgz1Q67w245Pj2vY1cuRhra Parsing: http://localhost:8080/ipfs/QmYmrzYVU1yirGiiceJ1DZYQgz1Q67w245Pj2vY1cuRhra (QmYmrzYVU1yirGiiceJ1DZYQgz1Q67w245Pj2vY1cuRhra) Language name :: NONE (0.000000)

Fetching: http://localhost:8080/ipfs/Qmcoq8W48UBiEbEyCwyxg3s8a2uakWcAGzcR4wsxP1J2pD Parsing: http://localhost:8080/ipfs/Qmcoq8W48UBiEbEyCwyxg3s8a2uakWcAGzcR4wsxP1J2pD (Qmcoq8W48UBiEbEyCwyxg3s8a2uakWcAGzcR4wsxP1J2pD) Language name :en: HIGH (0.999995)

Fetching: http://localhost:8080/ipfs/QmYrkVJ2gJnzTnxEXhMtwnT6ozc58ViSFeyPbqojt2X6xE Parsing: http://localhost:8080/ipfs/QmYrkVJ2gJnzTnxEXhMtwnT6ozc58ViSFeyPbqojt2X6xE (QmYrkVJ2gJnzTnxEXhMtwnT6ozc58ViSFeyPbqojt2X6xE) Language name :en: HIGH (0.999995)

Fetching: http://localhost:8080/ipfs/QmYCC4Y3rbuWGAayu7wap3JsgrX6puZvfCLT16JTmCWshB Parsing: http://localhost:8080/ipfs/QmYCC4Y3rbuWGAayu7wap3JsgrX6puZvfCLT16JTmCWshB (QmYCC4Y3rbuWGAayu7wap3JsgrX6puZvfCLT16JTmCWshB) Language name :en: HIGH (0.999996)

Fetching: http://localhost:8080/ipfs/QmQhfDiZwvuLf7pJ4psJMBDwTenY2dpNHmMF3jURnRZo1C Parsing: http://localhost:8080/ipfs/QmQhfDiZwvuLf7pJ4psJMBDwTenY2dpNHmMF3jURnRZo1C (QmQhfDiZwvuLf7pJ4psJMBDwTenY2dpNHmMF3jURnRZo1C) Language name :en: MEDIUM (0.857144)

Fetching: http://localhost:8080/ipfs/QmUZsxQgRtftgBVzsVgwwzbqxKpzc1BWWrdUTocL8HutTG Parsing: http://localhost:8080/ipfs/QmUZsxQgRtftgBVzsVgwwzbqxKpzc1BWWrdUTocL8HutTG (QmUZsxQgRtftgBVzsVgwwzbqxKpzc1BWWrdUTocL8HutTG) Language name :: NONE (0.000000)

Fetching: http://localhost:8080/ipfs/QmYX11HsNdFiQwHs4ZDHDE8xYijXig4vvEufnzAw2adD5W Parsing: http://localhost:8080/ipfs/QmYX11HsNdFiQwHs4ZDHDE8xYijXig4vvEufnzAw2adD5W (QmYX11HsNdFiQwHs4ZDHDE8xYijXig4vvEufnzAw2adD5W) Language name :en: HIGH (0.999996)

Fetching: http://localhost:8080/ipfs/Qmcw9mWH8YQJVoAbeU9uVMnpbWwDJwyUiRjcgZP7pNr53M

Fetching: http://localhost:8080/ipfs/QmVFnUPr8M49AHqprnK5ca5LdE2tJFmStJ1id9BBxHdAof Parsing: http://localhost:8080/ipfs/QmVFnUPr8M49AHqprnK5ca5LdE2tJFmStJ1id9BBxHdAof (QmVFnUPr8M49AHqprnK5ca5LdE2tJFmStJ1id9BBxHdAof) Language name :: NONE (0.000000) . . .


2. requester ouput:

$ python3 autorequestTika.py

requesting: https://api.ipfs-search.com/v1/search?q=this&page=0&_type=file

QmNsYTnm132vXQ4FDZAH9qcdg9hB7sKHVpubvAVQjreeBN 200 OK Lang from JSON Result: ": NONE (0.000000)"

QmUgEm6MJx5XN98qnYz4RaSgzaNYnqPd1dDXRiNvY1G7m6 200 OK Lang from JSON Result: "en: HIGH (0.999995)"

QmdmRUBm5KaC5ijJmWQi9BmbUCCuBapeaqQdDPgw93s39V 200 OK Lang from JSON Result: "en: HIGH (0.999995)"

QmWQFDGTojPsr3hKnKSRE5VwJkcqAUaQZPPGpnmhBHgGg9 200 OK Lang from JSON Result: "en: HIGH (0.999994)"

QmRHoxNNRVeRpcTvezmqnifEvz2YtnaZDv77x5HJxnwocT 200 OK Lang from JSON Result: "en: HIGH (0.999995)"

QmYmrzYVU1yirGiiceJ1DZYQgz1Q67w245Pj2vY1cuRhra 200 OK Lang from JSON Result: ": NONE (0.000000)"

Qmcoq8W48UBiEbEyCwyxg3s8a2uakWcAGzcR4wsxP1J2pD 200 OK Lang from JSON Result: "en: HIGH (0.999995)"

QmYrkVJ2gJnzTnxEXhMtwnT6ozc58ViSFeyPbqojt2X6xE 200 OK Lang from JSON Result: "en: HIGH (0.999995)"

QmYCC4Y3rbuWGAayu7wap3JsgrX6puZvfCLT16JTmCWshB 200 OK Lang from JSON Result: "en: HIGH (0.999996)"

QmQhfDiZwvuLf7pJ4psJMBDwTenY2dpNHmMF3jURnRZo1C 200 OK Lang from JSON Result: "en: MEDIUM (0.857144)"

QmUZsxQgRtftgBVzsVgwwzbqxKpzc1BWWrdUTocL8HutTG 200 OK Lang from JSON Result: ": NONE (0.000000)"

QmYX11HsNdFiQwHs4ZDHDE8xYijXig4vvEufnzAw2adD5W 200 OK Lang from JSON Result: "en: HIGH (0.999996)"

Qmcw9mWH8YQJVoAbeU9uVMnpbWwDJwyUiRjcgZP7pNr53M request timeout...

QmVFnUPr8M49AHqprnK5ca5LdE2tJFmStJ1id9BBxHdAof 200 OK Lang from JSON Result: ": NONE (0.000000)" . . .

dokterbob commented 5 years ago

Great work! Will take this into account on the next update.

I already have a patch here for configuring the port number in an environment variable, will be part of the next update!