johnhawkinson / recapupload

Upload documents to RECAP
1 stars 1 forks source link

Time zone skew leads to incorrect dates, sometimes #4

Open johnhawkinson opened 6 years ago

johnhawkinson commented 6 years ago

I assume @bdheath is running Big Cases in UTC? Because the most recent uploads to CL in the Waymo docket:

screen shot 2018-01-29 at 23 11 55

have tomorrow's date, but the RSS has the 29th when viewed in Pacific (or Eastern):

<item>
<title>3:17-cv-00939 Waymo LLC v. Uber Technologies, Inc. et al</title>
<link>https://ecf.cand.uscourts.gov/cgi-bin/DktRpt.pl?308136</link>
<description>[Transcript Order] (&#x3C;a href=&#x22;https://ecf.cand.uscourts.gov/doc1/035116343766?caseid=308136&#x26;de_seq_num=7416&#x22; &#x3E;2556&#x3C;/a&#x3E;)</description>
<guid isPermaLink="true">https://ecf.cand.uscourts.gov/cgi-bin/DktRpt.pl?308136&#x26;7416</guid>
<pubDate>Tue, 30 Jan 2018 02:28:00 GMT</pubDate>
</item>

I'm not sure what the right fix here is. I guess an easy one would be to assume times are in Pacific (or even Eastern), although that might cause problems for the people filing at 2am Eastern or filing near midnight in the Virgin Islands or Guam.

Another approach would be to have a lookup table of time zones for every court.

Any other clever suggestions? @mlissner, have you had to deal with related problems in juriscraper?

johnhawkinson commented 6 years ago

I assume @bdheath is running Big Cases in UTC?

Oops, or not. Turns out strftime() assumes a local time tuple, and _parse_date returns a UTC tuple, so this is independent of the local zone. So a fix would be something like:

>>> os.environ['TZ']='US/Pacific'
>>> time.tzset()
>>> time.strftime('%m/%d/%Y %T %Z', time.localtime(calendar.timegm(feedparser._parse_date(f.entries[0].published))))
'01/29/2018 20:19:10 PST'

It seems like a waste to use calendar.gmtime() but I don't think there's a real solution (other than bypassing feedparser._parse_date() which is surely a bad idea given the hokey time formats present in RSS feeds).