psf / requests

A simple, yet elegant, HTTP library.
https://requests.readthedocs.io/en/latest/
Apache License 2.0
52.17k stars 9.33k forks source link

Requests.put blocking on Windows (Tika-Python) #2725

Closed chrismattmann closed 9 years ago

chrismattmann commented 9 years ago

Over in the Apache Tika Python Port I'm noticing in tika-python#44 and in tika-python#58 some odd behavior with requests on Python 2.7.9. For whatever reason, when using a file handle and putting it with requests.put it blocks and blocks until finally it gets (correctly) a BadStatusLine back after a timeout. Anyone else seen this?

Lukasa commented 9 years ago

Any requests call that uploads a request body will send the entire body before attempting to read the response. That means that, if the remote end does not close that connection abruptly (throwing an Exception on our end), we'll block until the response has been entirely sent.

Sadly, we don't support the 100-continue flow at this time (because httplib has no way of letting us see what's going on there), so it's difficult for us to do anything else.

chrismattmann commented 9 years ago

Hi @Lukasa - what's weird though is that this works fine on *Nix and Mac. Tika Python is a Python library that uses (at its lowest level) requests to talk to the Tika JAX RS Server. It posts to the /rmeta endpoint. On Linux, calls like parser.from_file() work fine - on Windows, they block and block and then finally timeout. I'm not sure how the answer above has to do with that behavior but perhaps I didn't explain it well enough. Any ideas?

chrismattmann commented 9 years ago

Here's what I'm seeing on the Tika server end:

Aug 14, 2015 7:32:38 PM org.apache.tika.server.resource.TikaResource parse
WARNING: rmeta: Text extraction failed
org.apache.tika.io.TaggedIOException: timeout
    at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
    at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103)
    at org.apache.tika.io.TikaInputStream.peek(TikaInputStream.java:489)
    at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:83)
    at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
    at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:159)
    at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
    at org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:86)
    at org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:68)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)
    at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)
    at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
    at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
    at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
    at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
    at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
    at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
    at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
    at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
    at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
    at org.eclipse.jetty.server.Server.handle(Server.java:370)
    at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
    at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
    at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
    at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
    at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
    at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
    at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
    at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
    at java.lang.Thread.run(Unknown Source)
Caused by: org.eclipse.jetty.io.EofException: timeout
    at org.eclipse.jetty.http.HttpParser.blockForContent(HttpParser.java:1200)
    at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:61)
    at java.io.FilterInputStream.read(Unknown Source)
    at java.io.BufferedInputStream.fill(Unknown Source)
    at java.io.BufferedInputStream.read1(Unknown Source)
    at java.io.BufferedInputStream.read(Unknown Source)
    at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
    ... 40 more

Aug 14, 2015 7:32:38 PM org.apache.cxf.phase.PhaseInterceptorChain doDefaultLogging
WARNING: Interceptor for {http://resource.server.tika.apache.org/}MetadataResource has thrown exception, unwinding now
org.apache.cxf.interceptor.Fault: Could not send Message.
    at org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndingInterceptor.handleMessage(MessageSenderInterceptor.java:64)
    at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
    at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
    at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
    at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
    at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
    at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
    at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
    at org.eclipse.jetty.server.Server.handle(Server.java:370)
    at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
    at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
    at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
    at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
    at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
    at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
    at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
    at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
    at java.lang.Thread.run(Unknown Source)
Caused by: org.eclipse.jetty.io.EofException
    at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:914)
    at org.eclipse.jetty.http.HttpGenerator.complete(HttpGenerator.java:798)
    at org.eclipse.jetty.server.AbstractHttpConnection.commitResponse(AbstractHttpConnection.java:650)
    at org.eclipse.jetty.server.AbstractHttpConnection$Output.close(AbstractHttpConnection.java:1106)
    at org.apache.cxf.transport.http.AbstractHTTPDestination.flushHeaders(AbstractHTTPDestination.java:626)
    at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.flushHeaders(JettyHTTPDestination.java:286)
    at org.apache.cxf.transport.http.AbstractHTTPDestination$WrappedOutputStream.close(AbstractHTTPDestination.java:784)
    at org.apache.cxf.transport.AbstractConduit.close(AbstractConduit.java:56)
    at org.apache.cxf.transport.http.AbstractHTTPDestination$BackChannelConduit.close(AbstractHTTPDestination.java:720)
    at org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndingInterceptor.handleMessage(MessageSenderInterceptor.java:62)
    ... 24 more
Caused by: java.nio.channels.ClosedChannelException
    at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(Unknown Source)
    at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
    at org.eclipse.jetty.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:293)
    at org.eclipse.jetty.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:404)
    at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:844)
    ... 33 more

Aug 14, 2015 7:32:38 PM org.apache.cxf.phase.PhaseInterceptorChain doDefaultLogging
WARNING: Interceptor for {http://resource.server.tika.apache.org/}MetadataResource has thrown exception, unwinding now
org.apache.cxf.interceptor.Fault: XML_WRITE_EXC
    at org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.handleMessage(JAXRSDefaultFaultOutInterceptor.java:102)
    at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
    at org.apache.cxf.interceptor.AbstractFaultChainInitiatorObserver.onMessage(AbstractFaultChainInitiatorObserver.java:113)
    at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:371)
    at org.apache.cxf.interceptor.OutgoingChainInterceptor.handleMessage(OutgoingChainInterceptor.java:83)
    at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
    at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
    at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
    at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
    at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
    at org.eclipse.jetty.server.Server.handle(Server.java:370)
    at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
    at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
    at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
    at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
    at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
    at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
    at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
    at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
    at java.lang.Thread.run(Unknown Source)
Caused by: com.ctc.wstx.exc.WstxIOException: null
    at com.ctc.wstx.sw.BaseStreamWriter.flush(BaseStreamWriter.java:255)
    at org.apache.cxf.jaxrs.interceptor.JAXRSDefaultFaultOutInterceptor.handleMessage(JAXRSDefaultFaultOutInterceptor.java:100)
    ... 26 more
Caused by: org.eclipse.jetty.io.EofException
    at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:142)
    at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:107)
    at org.apache.cxf.transport.http_jetty.JettyHTTPDestination$JettyOutputStream.write(JettyHTTPDestination.java:325)
    at org.apache.cxf.io.AbstractWrappedOutputStream.write(AbstractWrappedOutputStream.java:51)
    at com.ctc.wstx.sw.EncodingXmlWriter.flushBuffer(EncodingXmlWriter.java:705)
    at com.ctc.wstx.sw.EncodingXmlWriter.flush(EncodingXmlWriter.java:174)
    at com.ctc.wstx.sw.BaseStreamWriter.flush(BaseStreamWriter.java:253)
    ... 27 more

And then on the requests end:

screen shot 2015-08-14 at 8 09 23 pm
chrismattmann commented 9 years ago

note this is all occurring using localhost. System env info:

Tika Server - Tried with JDK 1.8, and also with JDK 1.7. Tika Python 1.9.10 Requests 2.7.0 Python version 2.7.10

chrismattmann commented 9 years ago

Another interesting note. requests.put works fine on windows as long as I read the whole file into memory first, e.g., if I pass open(filename, 'r').read() into the data parameter for the PUT request. If I pass in just open(filename, 'r') as the data parameter, it hangs forever until the BadStatusLine.

sigmavirus24 commented 9 years ago

@chrismattmann so there's (as you could guess) a significant difference in behaviour between those two test cases.

When you just give us the raw data as a string, we write it all at once. When you give us an open file descriptor, we pass it down to httplib. What httplib then does with it is read 8192 bytes (yes 8 KB) at a time and write it to the socket to stream it. Note this is not the same as a chunked upload.

I suspect that Tika server doesn't like getting such small amounts over a period of time.

sigmavirus24 commented 9 years ago

If you want to test sending larger chunks (by forcing httplib to send what it reads) you could tinker with https://github.com/sigmavirus24/requests-toolbelt/pull/84/files to wrap your file to see if uploading more at once will aid this.

chrismattmann commented 9 years ago

OK, here is some more info. If I use something like:

requests.put(serverUrl, files={'filename' : open(filename, 'r')}, headers=headers)

Works as expected (however it makes Tika server get a 415 b/c for whatever reason Content-Type isn't passed in the HTTP request). However:

requests.put(serverUrl, data=open(filename, 'r'), headers=headers)

still fails.

chrismattmann commented 9 years ago

@sigmavirus24 thanks for the insight. I'll take a look at https://github.com/sigmavirus24/requests-toolbelt/pull/84

chrismattmann commented 9 years ago

One thing too @sigmavirus24 that is kind of odd though - this same behavior with e.g., data=open(filename, 'r') works fine on Linux, e.g., with Tika-server running on Linux and Tika Python running on Linux. It only seems to fail on Windows.

chrismattmann commented 9 years ago

Ahh, quick insight too, the 415 error request on Tika server is caused by the following content type passed: (multipart/form-data;boundary=d764925698424d719281a56941edb9b7). So it seems that files={'filename', open(filename, 'r')} causes a multi-part request using PUT; however, data=open(filename, 'r') doesn't.

Full Tika Server stack trace:

Aug 14, 2015 7:55:44 PM org.apache.tika.server.resource.TikaResource logRequest
INFO: rmeta (multipart/form-data;boundary=d764925698424d719281a56941edb9b7)
Aug 14, 2015 7:55:44 PM org.apache.tika.server.resource.TikaResource parse
WARNING: rmeta: Text extraction failed
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.server.resource.TikaResource$1@8a9107e
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:159)
    at org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:244)
    at org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:86)
    at org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:68)
    at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:181)
    at org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:97)
    at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:200)
    at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:99)
    at org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
    at org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
    at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:307)
    at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
    at org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:251)
    at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:261)
    at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:70)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
    at org.eclipse.jetty.server.Server.handle(Server.java:370)
    at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
    at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
    at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
    at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
    at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
    at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
    at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
    at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
    at java.lang.Thread.run(Unknown Source)
Caused by: javax.ws.rs.WebApplicationException: HTTP 415 Unsupported Media Type
    at org.apache.tika.server.resource.TikaResource$1.parse(TikaResource.java:111)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
    ... 36 more
sigmavirus24 commented 9 years ago

@chrismattmann please be aware that while you debug this in the comments on a closed issue, you're emailing potentially over 700 people subscribed to issues (creation, comments, etc.)

chrismattmann commented 9 years ago

sorry about that. I just thought providing more information would help. I'll stop spamming now while I continue to try and debug this.

chrismattmann commented 9 years ago

As an FYI, I don't consider this issue fixed/closed. It's unclear to me where the precise error here is - seems to be either in the way that requests generates PUT HTTP requests when passing data=open('filename', 'r') (which doesn't work on Windows when contacting Jetty from Tika Server on localhost) compared to the way it generates PUT HTTP requests when doing files={'filename' : open('filename', 'r')} (which works on Windows when contacting Jetty from Tika Server on localhost). I've seen things debugging around that Jetty by default uses Memory mapped IO which locks files on Windows. Not sure if that's the root cause of the issue here or not. Anyways wanted to share that.

Lukasa commented 9 years ago

To be clear, the root cause of the issue is likely to be in httplib, or possibly even lower (e.g. WinSock). Requests simply doesn't have a separate code path here.

sigmavirus24 commented 9 years ago

So I've lost my touch with reading Java stacktraces to be honest but I see

org.apache.tika.io.TaggedIOException: timeout

And I wonder if Tika has a timeout such that if it doesn't receive the full amount of data in that period, it times out the request and closes the socket. It could be attempting to protect against bad actors who do slow uploads to DDoS the server.

chrismattmann commented 9 years ago

thanks @Lukasa and @sigmavirus24 . The odd thing is this is an extremely small file (default win.ini) and the behavior is that i make the post - then it waits for like 2-3 minutes, then times out. Just odd.

Lukasa commented 9 years ago

@chrismattmann That was not clear to me previously, and makes the whole thing substantially more interesting.

I think I've seen something like this on hyper, and couldn't work out what was going on: I was getting situations where I was failing to read data I knew was in the socket buffer. See also: lukasa/hyper#142.

sigmavirus24 commented 9 years ago

@chrismattmann that would have been helpful to know at the beginning of this thread. That description seems a bit more consistent with

[snip]
Caused by: org.eclipse.jetty.io.EofException: timeout
    at org.eclipse.jetty.http.HttpParser.blockForContent(HttpParser.java:1200)
[snip]
Caused by: org.eclipse.jetty.io.EofException
    at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:914)
[snip]
Caused by: org.eclipse.jetty.io.EofException
    at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:142)

Note that those are from what I believe are three separate tracebacks, but my Java debugging is several years old and quite rusty.

Could you share exactly the file that you used? Further, what happens if you open the file as

with open(file, 'rb') as fd:
   ...

(or the equivalent in your code, the important part is opening the file with the b flag).

Let's also try some other permutations:

  1. Tika Server running on a non-Windows operating system, with Tika-Python running on Windows
  2. Tika Server running on Windows, with Tika-Python running on a non-Windows OS.

I'm fairly confident that this is a bug very far down the stack from us (which is why I'm not reopening this), but I'm interested in helping you find out enough information to report it to the appropriate place.

chrismattmann commented 9 years ago

thanks @Lukasa and @sigmavirus24 . Here is the file I was using (C:\Windows\win.ini)

; for 16-bit app support
[fonts]
[extensions]
[mci extensions]
[files]
[Mail]
MAPI=1

I tried with the 'b' flag as well, and there was no difference.

As for:

  1. Tika Server running on a non-Windows operating system, with Tika-Python running on Windows
  2. Tika Server running on Windows, with Tika-Python running on a non-Windows OS.

Got it. I will try both and report back. Thanks for helping.

chrismattmann commented 9 years ago

OK have an answer for this one:

  1. Tika Server running on a non-Windows operating system, with Tika-Python running on Windows

Steps to reproduce:

  1. CentOS centos7-x86_64-generic-cloud image, Openstack, Python 2.7.10 (default, Aug 12 2015, 04:31:43) , JDK "1.7.0_85" OpenJDK Runtime Environment (rhel-2.6.1.2.el7_1-x86_64 u85-b01) OpenJDK 64-Bit Server VM (build 24.85-b03, mixed mode) - running Tika-Server
  2. Windows 7 Ultimate - JDK 1.8, Python 2.7.10 - running Tika-Python (client)
  3. Following code on Windows 7 Ultimate (with a modified version of Tika-Python, diff provided last, note localhost:55000 is a proxy to the remote host):
   from tika import parser
   parser.from_file('/Windows/win.ini', 'http://localhost:55000')

produces (Windows side):

 Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    parser.from_file('/Windows/win.ini', 'http://localhost:55000')
  File "C:\Python27\lib\site-packages\tika\parser.py", line 24, in from_file
    jsonOutput = parse1('all', filename, serverEndpoint)
  File "C:\Python27\lib\site-packages\tika\tika.py", line 151, in parse1
    verbose, tikaServerJar)
  File "C:\Python27\lib\site-packages\tika\tika.py", line 256, in callServer
    resp = verbFn(serviceUrl, encodedData, headers=headers)
  File "C:\Python27\lib\site-packages\requests\api.py", line 122, in put
    return request('put', url, data=data, **kwargs)
  File "C:\Python27\lib\site-packages\requests\api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 465, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 573, in send
    r = adapter.send(request, **kwargs)
  File "C:\Python27\lib\site-packages\requests\adapters.py", line 415, in send
    raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', BadStatusLine("''",))

And on Linux/CentOS server side produces no log output on the tika-server.log file.

I verified that Windows 7 Ultimate could see the Linux machine (via Telnet, e.g., telnet linux-machine 9998).

Here's the diff:

diff --git a/tika/tika.py b/tika/tika.py
index da659ad..6a04f8a 100644
--- a/tika/tika.py
+++ b/tika/tika.py
@@ -247,8 +247,8 @@ def callServer(verb, serverEndpoint, service, data, headers, verbose=Verbose, ti
         die('Tika Server call must be one of %s' % str(httpVerbs.keys()))
     verbFn = httpVerbs[verb]

-    if Windows and type(data) is file:
-        data = data.read()    
+    #if Windows and type(data) is file:
+    #    data = data.read()    

     encodedData = data
     if type(data) is unicode:

So in case it wasn't clear, the first scenario produces, the exact same result. I'll try the second one now.

chrismattmann commented 9 years ago

OK here is part 2.

  1. Tika Server running on Windows, with Tika-Python running on a non-Windows OS.

Steps to reproduce:

  1. MacOS X 10.9.5, Python 2.7.8 (default, Sep 27 2014, 11:46:04)
  2. Windows 7 Ultimate - JDK 1.8, Python 2.7.10 - running Tika-Python (server), Java version "1.7.0_60" Java(TM) SE Runtime Environment (build 1.7.0_60-b19)

Following code on Mac OS X (with a modified version of Tika-Python, diff provided last note 192.X is the VMWare guest OS ip that I am accessing from my Mac):

from tika import parser
parser.from_file('/Users/mattmann/.bashrc', 'http://192.168.151.143:9998')

which produces:

{'content': u'\n\n\n\n\n\n\n\n\nalias tika="java -jar /usr/local/tika/tika-app-1.8.jar"\nalias ls="ls -FHG"\n\n', 'metadata': {u'resourceName': u'.bashrc', u'X-Parsed-By': [u'org.apache.tika.parser.DefaultParser', u'org.apache.tika.parser.txt.TXTParser'], u'Content-Type': u'text/plain; charset=ISO-8859-1', u'X-TIKA:parse_time_millis': u'4', u'Content-Encoding': u'ISO-8859-1'}}

No error logs from Windows as request was successful. Note on the Windows side, I had to restart the tika-server.jar file with the argument --host 0.0.0.0 in order to get it to bind to all local interfaces, so that I could communicate with it outside of the Windows box.

Here's the diff:

diff --git a/tika/tika.py b/tika/tika.py
index da659ad..6a04f8a 100644
--- a/tika/tika.py
+++ b/tika/tika.py
@@ -247,8 +247,8 @@ def callServer(verb, serverEndpoint, service, data, headers, verbose=Verbose, ti
         die('Tika Server call must be one of %s' % str(httpVerbs.keys()))
     verbFn = httpVerbs[verb]

-    if Windows and type(data) is file:
-        data = data.read()    
+    #if Windows and type(data) is file:
+    #    data = data.read()    

     encodedData = data
     if type(data) is unicode:
chrismattmann commented 9 years ago

any comments here @sigmavirus24 @Lukasa ?

Lukasa commented 9 years ago

Right now it might be helpful if I could also see some packet capture of the failing scenario. Are you familiar with tcpdump?

chrismattmann commented 9 years ago

hi @Lukasa nope I haven't used it. I did try and find some programs that would do this after I found out Wireshark won't capture localhost on Windows (or it will if you hack some networking routes, etc., but I tried and none of the approaches worked). Any advice here would be appreciated.

Lukasa commented 9 years ago

Ah, yeah, capturing localhost on Windows is tricky. I've used RawCap in the past to some effect.

chrismattmann commented 9 years ago

hey @Lukasa yep I tried to use RawCap unfortunately the frickin' page download redirects to an non existent URL and I can't download the tool :( I also couldn't find it on the internet, sadly.

Lukasa commented 9 years ago

=( That's unhelpful. Is it going to be easier to temporarily bring a non-Windows box onto your LAN for the packet capture?

chrismattmann commented 9 years ago

Hi @Lukasa well see that's the thing - this error only manifests on Windows. There are not any issues from Linux to Linux; from Linux to Windows; but there are errors Windows to Windows.

Lukasa commented 9 years ago

Oh, both the server and the client have to be Windows? Can we spin this up on multiple Windows boxes in AWS?

sigmavirus24 commented 9 years ago

@chrismattmann did I misunderstand your earlier comments? I thought we only needed the client to be on Windows to reproduce this.

chrismattmann commented 9 years ago

sorry guys, misspoke. Per https://github.com/kennethreitz/requests/issues/2725#issuecomment-132040020 (the Tika server running on Linux and Tika-Python [client] on Windows) which doesn't work and https://github.com/kennethreitz/requests/issues/2725#issuecomment-132092182 (the Tika server running on Windows, and Tika-Python [client] running on Linux) which does work, there are not just Windows issues, correct.

chrismattmann commented 9 years ago

have either of you @sigmavirus24 and @Lukasa been able to replicate https://github.com/kennethreitz/requests/issues/2725#issuecomment-132040020 and https://github.com/kennethreitz/requests/issues/2725#issuecomment-132092182 on your own machines? That would help I think and I did do the work there hoping you guys would try to replicate.

Lukasa commented 9 years ago

I haven't had a chance to run this locally because I need to set up some Windows VMs to do that. I'll aim to do so sometime this week.

sigmavirus24 commented 9 years ago

@chrismattmann I'm trying to get a hold of a friend who can reproduce this since I don't have access to Windows VMs at the moment. I've been traveling a lot lately so finding time to borrow a friend's laptop hasn't worked out well.

chrismattmann commented 9 years ago

thanks guys

chrismattmann commented 9 years ago

hey guys @Lukasa @sigmavirus24 either of you get a chance to try this or replicate?

Lukasa commented 9 years ago

Still not yet. =( I'm crazy busy at the minute and probably won't find much time until next month.

sigmavirus24 commented 9 years ago

Since we're investigating this (or at least planning to when either of us find time) let's reopen it for now

chrismattmann commented 9 years ago

guys any progress @Lukasa @sigmavirus24 ?

Lukasa commented 9 years ago

Nope, but I also haven't had time to. I'm pushing requests 2.8.1 today, after which time I should be able to grab some Windows VMs and investigate a bit.

Lukasa commented 9 years ago

Ok, 2.8.1 pushed. Let's see what I can see.

Lukasa commented 9 years ago

So, @chrismattmann: right now I can't reproduce because tika-python on my Windows box falls at the first hurdle:

>>> from tika import parser
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\site-packages\tika\parser.py", line 19, in <module>
    from tika import parse1, callServer, ServerEndpoint
ImportError: cannot import name 'parse1'
Lukasa commented 9 years ago

Oh, nevermind, I see that this is chrismattmann/tika-python#68. I'll try again with Python 2.7.

Lukasa commented 9 years ago

Alright, without your diff tika-python works just fine. With it, we get a hang. This is a good starting point, I can reproduce the bug now.

Lukasa commented 9 years ago

Cool, I have pcap files of successful and failed transactions. Here's the successful one, stripped to just the HTTP:

PUT /rmeta HTTP/1.1
Host: 10.0.0.4:9998
Content-Length: 376
Accept-Encoding: gzip, deflate
Content-Disposition: attachment; filename=win.ini
Accept: application/json
User-Agent: python-requests/2.8.1
Connection: keep-alive

; for 16-bit app support
[fonts]
[extensions]
[mci extensions]
[files]
[Mail]
MAPI=1
[MCI Extensions.BAK]
3g2=MPEGVideo
3gp=MPEGVideo
3gp2=MPEGVideo
3gpp=MPEGVideo
aac=MPEGVideo
adt=MPEGVideo
adts=MPEGVideo
m2t=MPEGVideo
m2ts=MPEGVideo
m2v=MPEGVideo
m4a=MPEGVideo
m4v=MPEGVideo
mod=MPEGVideo
mov=MPEGVideo
mp4=MPEGVideo
mp4v=MPEGVideo
mts=MPEGVideo
ts=MPEGVideo
tts=MPEGVideo
HTTP/1.1 200 OK
Content-Type: application/json
Date: Wed, 14 Oct 2015 11:42:51 GMT
Transfer-Encoding: chunked
Server: Jetty(8.y.z-SNAPSHOT)

30D
[{"Content-Encoding":"ISO-8859-1","Content-Type":"text/plain; charset\u003dISO-8859-1","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.txt.TXTParser"],"X-TIKA:content":"\n\n\n\n\n\n\n\n\n; for 16-bit app support\n[fonts]\n[extensions]\n[mci extensions]\n[files]\n[Mail]\nMAPI\u003d1\n[MCI Extensions.BAK]\n3g2\u003dMPEGVideo\n3gp\u003dMPEGVideo\n3gp2\u003dMPEGVideo\n3gpp\u003dMPEGVideo\naac\u003dMPEGVideo\nadt\u003dMPEGVideo\nadts\u003dMPEGVideo\nm2t\u003dMPEGVideo\nm2ts\u003dMPEGVideo\nm2v\u003dMPEGVideo\nm4a\u003dMPEGVideo\nm4v\u003dMPEGVideo\nmod\u003dMPEGVideo\nmov\u003dMPEGVideo\nmp4\u003dMPEGVideo\nmp4v\u003dMPEGVideo\nmts\u003dMPEGVideo\nts\u003dMPEGVideo\ntts\u003dMPEGVideo\n\n","X-TIKA:parse_time_millis":"6","resourceName":"win.ini"}]
0

Here's the unsuccessful one:

PUT /rmeta HTTP/1.1
Host: 10.0.0.4:9998
Content-Length: 403
Accept-Encoding: gzip, deflate
Content-Disposition: attachment; filename=win.ini
Accept: application/json
User-Agent: python-requests/2.8.1
Connection: keep-alive

; for 16-bit app support
[fonts]
[extensions]
[mci extensions]
[files]
[Mail]
MAPI=1
[MCI Extensions.BAK]
3g2=MPEGVideo
3gp=MPEGVideo
3gp2=MPEGVideo
3gpp=MPEGVideo
aac=MPEGVideo
adt=MPEGVideo
adts=MPEGVideo
m2t=MPEGVideo
m2ts=MPEGVideo
m2v=MPEGVideo
m4a=MPEGVideo
m4v=MPEGVideo
mod=MPEGVideo
mov=MPEGVideo
mp4=MPEGVideo
mp4v=MPEGVideo
mts=MPEGVideo
ts=MPEGVideo
tts=MPEGVideo

There's one key difference: the content-length header! In the successful attempt, we have Content-Length: 376. In the failed attempt, we have Content-Length: 403. The body in each case is the same, as are all other headers.

Lukasa commented 9 years ago

This points the cause of the problem at super_len on Windows.

Lukasa commented 9 years ago

Actually, no it doesn't: it points the problem at tika-python.

Observe the following in my Windows Python prompt:

>>> from requests.utils import super_len
>>> f = open('/Windows/win.ini', 'r')
>>> super_len(f)
403L
>>> data = f.read()
>>> super_len(data)
376
>>> f = open('/Windows/win.ini', 'rb')
>>> super_len(f)
403L
>>> data = f.read()
>>> super_len(data)
403

This is a thoroughly amusing problem! Anyone got a guess about what's going wrong here? I'm going to leave a nice big white block while you think about it, because the answer is unexpected.

It's universal newlines. Universal freaking newlines. Check it out:

>>> f = open('/Windows/win.ini', 'r')
>>> data1 = f.read()
>>> f = open('/Windows/win.ini', 'rb')
>>> data2 = f.read()
>>> data1[:26]
'; for 16-bit app support\n['
>>> data2[:26]
'; for 16-bit app support\r\n'

When you pass the file handle into us, we ask the filesystem how many bytes there are. It says: 403. But, when you pass it through Python, it kindly strips the \r from each line, causing us to lose a few bytes. When you passed us the string this didn't matter, because we only checked how long the string was, but when you pass us a file handle it all goes wrong.

The solution? Add one character to parse1 in tika.py to ensure that you open the file as rb, not r.

Lukasa commented 9 years ago

FWIW, I consider this to be an issue in tika-python, not in requests. However, we may want to update our documentation to suggest opening files in binary mode to avoid this kind of problem.

Lukasa commented 9 years ago

Interestingly, none of our documentation has a single example of open used without binary mode. I'll just provide a docs update that warns about this, then.

@sigmavirus24 files opened in text mode is a thing we can detect in code pretty easily: is it worth us lobbing out a warning about it?