pombreda / feedparser

Automatically exported from code.google.com/p/feedparser
Other
0 stars 0 forks source link

`updated` can be a 9-tuple or a string, depending on context #250

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
feed['updated'] is documented to return an unparsed *string* corresponding to 
e.g. /rss/channel/pubDate. The parsed datetime.time equivalent is documented to 
be available in feed['updated_parsed']. See:
http://www.feedparser.org/docs/reference-feed-updated.html

However, feed['updated'] frequently instead returns a datetime.time object:
>> 
feedparser.parse("http://www.njtransit.com/rss/RailAdvisories_feed.xml")['update
d']
time.struct_time(tm_year=2011, tm_mon=1, tm_mday=24, tm_hour=20, tm_min=31, 
tm_sec=8, tm_wday=0, tm_yday=24, tm_isdst=0)

The problem is that feedparser looks up the HTTP Last-Modified header and 
stores the *parsed* result in 'modified', which maps (via FeedParserDict 
keymap) to 'updated', overwriting the previous value parsed from XML.

Expected behavior is for the HTTP lookup to store the parsed date in 
'modified_parsed', and the raw string in 'modified'.

Or better yet, store the HTTP value in a separate variable and fall back on the 
HTTP value if 'updated' isn't present in XML (but that's a decision I'll leave 
to someone better-versed in the feedparser API).

Original issue reported on code.google.com by adamjer...@gmail.com on 24 Jan 2011 at 8:37

GoogleCodeExporter commented 9 years ago
This is an inconsistency within feedparser, but it's not something that will be 
fixed in time for the next release due to existing applications' expectations. 
You do appear to be confusing two distinct sources of dates, but even if that's 
the case it only highlights why this needs to be changed:

`result.feed.updated` is always a string representing when the feed was last 
modified *according to the XML file itself*

`result.updated` is always a 9-tuple representing when the feed was last 
modified *according to the server*.

The string version of the HTTP Last-Modified header is available in 
`result.headers['last-modified']` (or 'Last-Modified' in Python 3). Regardless, 
for consistency feedparser needs to store the string version in 
`result.updated` and the parsed 9-tuple version in `result.updated_parsed`. 
Again, this will break existing applications, so the fix won't be in the next 
release.

Thank you for reporting this, Adam! I'll get it fixed sometime after the next 
release.

Original comment by kurtmckee on 24 Jan 2011 at 11:51

GoogleCodeExporter commented 9 years ago
Thanks Kurt!

Original comment by adamjer...@gmail.com on 25 Jan 2011 at 3:01

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
did you happen to get a chance to look at this ?

i'm asking because i just tripped over something very similar which i believe 
is related:
as opposed to the sample usage shown at http://www.feedparser.org (cool page 
btw - zero guff, just hard facts :-),
>>> e.updated_parsed
does not return
(2005, 11, 9, 11, 56, 34, 2, 313, 0)
but
time.struct_time(tm_year=2005, tm_mon=11, tm_mday=9, tm_hour=11, tm_min=56, 
tm_sec=34, tm_wday=2, tm_yday=313, tm_isdst=0)
at least on my current mac osx machine.

so while strictly speaking this is a different attribute than what it mentioned 
in the bug report, i hazard a guess that the issues are related.

in the meantime, is there any recommendation for a workaround ? i thought about 
using the 'native' i.e. unparsed date/time string coming straight from the 
source, but it seems every provider uses a different format and i might have to 
try quite a few strptime formats, something i would really like to avoid...

Original comment by steven.s...@gmail.com on 26 Aug 2011 at 4:33

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
@Steven: I'm going to look at this as soon as possible, and as time permits. 
I'm really sorry for the delay!

@jeftine: I'm tired of you spamming the feedparser issue tracker. Don't come 
back.

Original comment by kurtmckee on 28 Aug 2011 at 2:04

GoogleCodeExporter commented 9 years ago
Fixed in r593.

@Steven: Python's `time.struct_time` is basically a named tuple. You can 
iterate over it like a tuple and specify slices and so forth, but you can also 
pick out a particular value by name, such as `e.updated_parsed.tm_year`. You 
can also convert it to a tuple if you don't need the additional functionality, 
but it's not necessary.

Original comment by kurtmckee on 12 Sep 2011 at 2:58