kurtmckee / feedparser

Parse feeds in Python
https://feedparser.readthedocs.io
Other
1.97k stars 342 forks source link

What is your recommended way to convert feedparser s date representation to datetime object? #321

Closed slidenerd closed 9 months ago

slidenerd commented 2 years ago

I think this question belongs here and not on stackoverflow because as the library author you would be able to answer this best

Issues I referenced before asking https://github.com/kurtmckee/feedparser/issues/212 https://github.com/kurtmckee/feedparser/issues/51

Problem

How to reproduce this problem


def md5(text):
    import hashlib
    return hashlib.md5(text.encode('utf-8')).hexdigest()

def fetch():
    import feedparser
    data = feedparser.parse('https://cointelegraph.com/rss')
    return data

async def insert(rows):
    import asyncpg
    async with asyncpg.create_pool(user='postgres', database='postgres') as pool:
        async with pool.acquire() as conn:
            results = await conn.executemany('INSERT INTO test (feed_item_id, pubdate) VALUES($1, $2)', rows)
            print(results)

async def main():
    data = fetch()
    first_entry = data.entries[0]
    await insert([(md5(first_entry.guid), first_entry.published)])
    await insert([(md5(first_entry.guid), first_entry.published_parsed)])

import asyncio
asyncio.run(main())

Both insert statements above will fail

What have I found so far?

I found 3 methods but they seem to have a limitation each

Method 1

Convert it with strptime

import feedparser
data = feedparser.parse('https://cointelegraph.com/rss')
pubdate = data.entries[0].published
pubdate_parsed = data.entries[0].published_parsed

>>> pubdate
'Thu, 04 Aug 2022 06:53:42 +0100'

I could do this


>>> method1 = datetime.strptime(pubdate, '%a, %d %b %Y %H:%M:%S %z')
>>> method1
datetime.datetime(2022, 8, 4, 6, 53, 42, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600)))

I am guessing this would raise an error if some feed returns an incorrect format and also I am not sure if this works when an extra leapsecond gets added

Method 2


>>> datetime.fromtimestamp(mktime(pubdate_parsed))
datetime.datetime(2022, 8, 4, 5, 53, 42)

This seems to completely lose out the timezone information or am I wrong about it? What happens here if there is a DST

Method 3 Requires a third party library called dateutil and shown below https://stackoverflow.com/a/18726020/5371505

Question

Thank you for your time

mattzque commented 1 year ago

I'm not the developer, but they do document it here: https://feedparser.readthedocs.io/en/latest/date-parsing.html#advanced-date

Different feed types and versions use wildly different date formats. Universal Feed Parser will attempt to auto-detect the date format used in any date element, and parse it into a standard Python 9-tuple in UTC

So I believe to create a timezone aware datetime object, you would do something like:

from time import mktime
from datetime import datetime, timezone
datetime.fromtimestamp(mktime(pubdate_parsed), timezone.utc)
armhold commented 5 months ago
from time import mktime
from datetime import datetime, timezone
datetime.fromtimestamp(mktime(pubdate_parsed), timezone.utc)

I'm wondering if it's more correct to call calendar.timegm() rather than mktime.

The documentation has a chart that explains that mktime() is assuming your struct_time is in local time, whereas timegm() assumes UTC. Is it possible that mktime is working by accident if localtime is set to UTC for you?

I find this stuff genuinely confusing.

armhold commented 5 months ago

Indeed I'm getting different results with this when I change the TZ env var:

#!/usr/bin/python3

import time
from datetime import datetime, timezone
from calendar import timegm
from time import mktime, tzname

print(f"tzname: {tzname}, timezone: {time.timezone}")

pubdate_parsed = time.struct_time((2022, 8, 4, 5, 53, 42, 3, 216, 0))
pub_mktime = datetime.fromtimestamp(mktime(pubdate_parsed), timezone.utc)
pub_timegm = datetime.fromtimestamp(timegm(pubdate_parsed), timezone.utc)

print(f"pubdate_parsed: {pubdate_parsed}")
print(f"        mktime: {pub_mktime}")
print(f"        timegm: {pub_timegm}")
root@69655a7e9e27:/usr/src/app# TZ='UTC' python try.py
tzname: ('UTC', 'UTC'), timezone: 0
pubdate_parsed: time.struct_time(tm_year=2022, tm_mon=8, tm_mday=4, tm_hour=5, tm_min=53, tm_sec=42, tm_wday=3, tm_yday=216, tm_isdst=0)
        mktime: 2022-08-04 05:53:42+00:00
        timegm: 2022-08-04 05:53:42+00:00
root@69655a7e9e27:/usr/src/app# TZ='EST' python try.py
tzname: ('EST', 'EST'), timezone: 18000
pubdate_parsed: time.struct_time(tm_year=2022, tm_mon=8, tm_mday=4, tm_hour=5, tm_min=53, tm_sec=42, tm_wday=3, tm_yday=216, tm_isdst=0)
        mktime: 2022-08-04 10:53:42+00:00
        timegm: 2022-08-04 05:53:42+00:00