Closed arisudesu closed 7 years ago
The timeout was added for urllib2
There are two imports Request
and urlopen
which are used from that module. So, I think it should be alright to make this change.
Something like this should work
diff --git a/src/boilerpipe/extract/__init__.py b/src/boilerpipe/extract/__init__.py
index 2225b58..c8cfd54 100644
--- a/src/boilerpipe/extract/__init__.py
+++ b/src/boilerpipe/extract/__init__.py
@@ -7,13 +7,13 @@ import socket
import chardet
import threading
-socket.setdefaulttimeout(15)
lock = threading.Lock()
InputSource = jpype.JClass('org.xml.sax.InputSource')
StringReader = jpype.JClass('java.io.StringReader')
HTMLHighlighter = jpype.JClass('de.l3s.boilerpipe.sax.HTMLHighlighter')
BoilerpipeSAXInput = jpype.JClass('de.l3s.boilerpipe.sax.BoilerpipeSAXInput')
+DEFAULT_TIMEOUT = 15
class Extractor(object):
"""
@@ -33,10 +33,10 @@ class Extractor(object):
data = None
headers = {'User-Agent': 'Mozilla/5.0'}
- def __init__(self, extractor='DefaultExtractor', **kwargs):
+ def __init__(self, extractor='DefaultExtractor', timeout=DEFAULT_TIMEOUT, **kwargs):
if 'url' in kwargs:
request = Request(kwargs['url'], headers=self.headers)
- connection = urlopen(request)
+ connection = urlopen(request, timeout=timeout)
self.data = connection.read()
encoding = connection.headers['content-type'].lower().split('charset=')[-1]
if encoding.lower() == 'text/html':
@arisudesu Please feel free to create a PR, and assign this issue to yourself.
That's what I was thinking about, exactly. Made the PR: #53
It seems to me that changing the global timeout is not a good idea. This can affect other code that uses networking. My question is, is it possible to replace the use of socket.setdefaulttimeout with the use of timeout param on urlopen? The parameter exists since python 2.6 / 3.0, and the JPype requires 2.6 anyway, so it is possible to use this parameter.