misja / python-boilerpipe

Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages
Other
539 stars 143 forks source link

Use of socket.setdefaulttimeout in import-level code #52

Closed arisudesu closed 7 years ago

arisudesu commented 7 years ago

It seems to me that changing the global timeout is not a good idea. This can affect other code that uses networking. My question is, is it possible to replace the use of socket.setdefaulttimeout with the use of timeout param on urlopen? The parameter exists since python 2.6 / 3.0, and the JPype requires 2.6 anyway, so it is possible to use this parameter.

tuxdna commented 7 years ago

The timeout was added for urllib2

There are two imports Request and urlopen which are used from that module. So, I think it should be alright to make this change.

tuxdna commented 7 years ago

Something like this should work

diff --git a/src/boilerpipe/extract/__init__.py b/src/boilerpipe/extract/__init__.py
index 2225b58..c8cfd54 100644
--- a/src/boilerpipe/extract/__init__.py
+++ b/src/boilerpipe/extract/__init__.py
@@ -7,13 +7,13 @@ import socket
 import chardet
 import threading

-socket.setdefaulttimeout(15)
 lock = threading.Lock()

 InputSource        = jpype.JClass('org.xml.sax.InputSource')
 StringReader       = jpype.JClass('java.io.StringReader')
 HTMLHighlighter    = jpype.JClass('de.l3s.boilerpipe.sax.HTMLHighlighter')
 BoilerpipeSAXInput = jpype.JClass('de.l3s.boilerpipe.sax.BoilerpipeSAXInput')
+DEFAULT_TIMEOUT = 15

 class Extractor(object):
     """
@@ -33,10 +33,10 @@ class Extractor(object):
     data      = None
     headers   = {'User-Agent': 'Mozilla/5.0'}

-    def __init__(self, extractor='DefaultExtractor', **kwargs):
+    def __init__(self, extractor='DefaultExtractor', timeout=DEFAULT_TIMEOUT, **kwargs):
         if 'url' in kwargs:
             request     = Request(kwargs['url'], headers=self.headers)
-            connection  = urlopen(request)
+            connection  = urlopen(request, timeout=timeout)
             self.data   = connection.read()
             encoding    = connection.headers['content-type'].lower().split('charset=')[-1]
             if encoding.lower() == 'text/html':
tuxdna commented 7 years ago

@arisudesu Please feel free to create a PR, and assign this issue to yourself.

arisudesu commented 7 years ago

That's what I was thinking about, exactly. Made the PR: #53

tuxdna commented 7 years ago

Closed via: https://github.com/misja/python-boilerpipe/pull/53