jsumners / feedparser

Automatically exported from code.google.com/p/feedparser
Other
0 stars 0 forks source link

unmodified redirected feed returns status 302 instead of 304 #390

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
If you try to fetch a feed that is redirected to another URL, and then the feed 
has not been modified such that the redirected request returns status 304, the 
result returned by parse has status set to 302 when it should be set to 304 to 
make it clear to the caller that the feed is not modified.

Otherwise, the caller can't distinguish this from a redirected request that 
produces an incorrect, empty response. Well, at least, not without doing 
pattern matching against result.debug_message, which I'm sure you don't intend.

The fix is trivial:

--- feedparser.py   2013/02/13 02:38:25 1.1
+++ feedparser.py   2013/02/13 02:38:45
@@ -3959,6 +3959,7 @@

     # Stop processing if the server sent HTTP 304 Not Modified.
     if getattr(f, 'code', 0) == 304:
+        result['status'] = 304
         result['version'] = u''
         result['debug_message'] = 'The feed has not changed since you last checked, ' + \
             'so the server sent no data.  This is a feature, not a bug!'

Original issue reported on code.google.com by jikam...@gmail.com on 13 Feb 2013 at 2:39

GoogleCodeExporter commented 9 years ago
I've just run into this with a URL that redirects to a 404 error -- you can't 
tell that it's an error from feedparser's result.

This is actually a regression: feedparser SVN r291 returned 'status': 404, and 
current Git HEAD returns 'status': 301 (with the results otherwise being 
identical). Having done a bisection, the behaviour changed in patch 
"[63b82303ae173c007426f8ae4d33e94cbe3b7411] Fix HTTP redirection in Python 3" 
-- before that change, the redirection handlers in _FeedURLHandler only set the 
status if it wasn't already set.

(Maybe it'd be useful to return the response codes for both the initial and 
final requests?)

Original comment by ats-goog...@offog.org on 21 Jun 2013 at 4:54

GoogleCodeExporter commented 9 years ago
Two patches attached. The first just corrects the regression; the second adds a 
redirect_status attribute that also returns the (first) redirect's status.

Original comment by ats-goog...@offog.org on 21 Jun 2013 at 5:54

Attachments:

GoogleCodeExporter commented 9 years ago
After further investigation, the original behaviour in the case of a redirect 
was actually to return the final request's status if it's an error code (i.e. 
not 2xx), otherwise to return the first redirect's status. My first patch 
above'll do that, but the description is wrong.

However, even that behaviour isn't sufficient, because it's also useful for an 
aggregator to be able to distinguish different non-error codes behind a 
temporary redirect -- e.g. 200 vs. 206 for RFC 3229 "A-IM: feed" support. So 
how about sticking with the current meaning for "status" (the initial 
response's status), and adding a "final_status" attribute giving the final 
response's status?

Any thoughts welcome...

Original comment by ats-goog...@offog.org on 21 Jun 2013 at 11:13

GoogleCodeExporter commented 9 years ago
Caller should not have to worry about intermediate statuses if it doesn't care. 
Status that gets returned in status field should be final status. If there are 
referrals and you want to return them, then have a separate variable containing 
the chain of statuses in order, with additional information about the referrals.

Original comment by jkam...@quantopian.com on 21 Jun 2013 at 11:20

GoogleCodeExporter commented 9 years ago
It turns out that it's quite straightforward to capture the complete chain of 
HTTP responses using a urllib2 handler, so I'm now doing it that way and not 
using 'status' at all. Example attached.

Original comment by ats-goog...@offog.org on 24 Jun 2013 at 9:57

Attachments: