libo26 / feedparser

Automatically exported from code.google.com/p/feedparser
Other
0 stars 0 forks source link

add a timeout parameter on parse function #245

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
My feed crawler blocked long time on some bad url that takes very long time to 
respond, I think it shoud have a 'timeout' parameter on feedparser.parse() to 
prevent.

Original issue reported on code.google.com by flytwoki...@gmail.com on 2 Jan 2011 at 4:07

GoogleCodeExporter commented 9 years ago
@Adewale: This is a duplicate of issue 221.

@flytwokites: From the discussion in issue 221:

    import socket
    socket.setdefaulttimeout(<timeout in floating seconds>)

By default Python's timeout is `None`, so I recommend setting this to a more 
reasonable value for your software.

Original comment by kurtmckee on 3 Jan 2011 at 1:03

GoogleCodeExporter commented 9 years ago
But, I need a small timeout to parse feeds, e.g. 5 seconds, and I also use the 
socket lib to connect to other important service like remote ssh/smtp..., set a 
global timeout too small may cause other codes that use socket lib be unstable.

Original comment by flytwoki...@gmail.com on 3 Jan 2011 at 1:12

GoogleCodeExporter commented 9 years ago
That makes sense, assuming that your application is running a web crawler, SMTP 
client/server, and an SSH client/server all in the same interpreter instance. 
If you're running all of that in separate Python interpreter instances, setting 
the default timeout in one won't affect the others.

Meanwhile, I've grepped through the Python 2.4 and 2.5 standard library on my 
system, scoured search engines for hints, and have found absolutely nothing for 
supporting timeouts on systems other than Python 2.6 and up. In issue 221 I 
wrote why I strongly disagree with adding a `timeout` argument to `parse()`, 
but adding a module-level variable that doesn't actually work in 2.4 and 2.5 
without causing thread-safety issues isn't an option either.

There are some other options available:

1) You can manually add a timeout argument to feedparser, in function 
`_open_resource()`, at what is currently line 2878. Look for the following line 
and change it to:

  BEFORE: return opener.open(request)
   AFTER: return opener.open(request, timeout=<whatever>)

2) If you're using 2.4 or 2.5, you have to use the code I mentioned above. If 
you're not using a multi-threaded application, I expect you can set the default 
timeout as I mentioned above and then change it back after calling feedparser's 
`parse()` function. You could also use a separate URL-fetching library (such as 
Twisted) to get the headers and data, and then run the raw data through 
`parse()`.

3) Again, if you're not running a feed aggregator, SSH client, and SMTP client 
all in the same program, setting the default timeout won't have any affect from 
one program to the next.

Original comment by kurtmckee on 3 Jan 2011 at 5:49

GoogleCodeExporter commented 9 years ago
Thank you for your reply.

I using python 2.7, so I make a local copy of feedparser to add timeout 
parameter.

What about add timeout to feedparser, and note in document that it needs python 
>= 2.6 otherwise will raise an exception.

Original comment by flytwoki...@gmail.com on 3 Jan 2011 at 8:44

GoogleCodeExporter commented 9 years ago
I'm marking this as a duplicate of issue 221.
Feedparser's code already has too many version-specific parts. I don't believe 
that timeout is important enough to justify adding another one. Moreover it 
would be horribly mis-leading to people using earlier versions of Python to 
suggest (via the parser() signature that we support timeout when we don't.

However I'd change my mind if you came up with a design based on wrapping the 
existing code with something which explicitly said that it only supports Python 
>= 2.6 and is thread-safe.

Original comment by adewale on 3 Jan 2011 at 8:00

GoogleCodeExporter commented 9 years ago
> I don't believe that timeout is important enough

If you'd like the module to be considered a toy not suitable for serious work I 
agree.  As mentioned before, setting it globally is not an answer for many.

> Feedparser's code already has too many version-specific parts.

Unfortunately, that's the reality of supporting multiple versions of python.  
If it's becoming unbearable then perhaps it's time to drop < 2.6 instead.

> would be horribly mis-leading to people using earlier versions

So, the potential of misleading the minority (on < 2.6) is more important than 
fixing an important issue for the majority?  As long as they get a short 
explanation in the docs and a useful error that should be plenty.

They won't lose anything but a minute of potential disappointment however the 
masses using 2.6+ will gain an important capability.

I might understand your position if it broke things for earlier versions but 
since it doesn't find it hard to justify.

Original comment by geekad...@mgmiller.net on 22 Jan 2012 at 8:36

GoogleCodeExporter commented 9 years ago
Hi Mike, thank you for contributing to the discussion. There are a few points 
I'd like to address.

First, feedparser isn't an HTTP client library. I expect that you and the other 
four developers who've written comments in the two related bug reports will 
find that a true HTTP client library will be better suited for larger, more 
comprehensive projects.

Second, feedparser isn't an HTTP client library. The version-specific parts 
that Ade was referring to have to do primarily with changes in the XML 
libraries and sgmllib, and are there so that nobody experiences a loss of 
functionality. It's not unbearable at all!

Third, feedparser isn't an HTTP client library. Its `parse()` function will not 
include a `timeout` argument.

Original comment by kurtmckee on 22 Jan 2012 at 9:10

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Sincerely intended... if you won't do it right, why do it at all?  

A parse function shouldn't take an url as argument, yet it does.  To run 98% of 
the way to the finish line, yet refuse to cross only frustrates.  If it's not 
an http client library, better off removing that functionality than misleading 
with a toy implementation that increases complexity.

Original comment by geekad...@mgmiller.net on 23 Jan 2012 at 5:09