mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

what timezone is `indexed_date` returned/interpreted as? #247

Closed rahulbot closed 9 months ago

rahulbot commented 9 months ago

Every story has an indexed_date added to it. This date-time is returned to users as a property of stories as they page through them. This is helpful for systems that are looking for new stories, because they can pass in a filter clause on the query that uses the indexed_date field to specific only stories after the specified date (ie. the latest they fetched last time). However, to do this properly we need to know what timezone indexed-date is the returned in, and in what timezone it should send it as in order to filter properly.

(BTW: I hate timezones)

philbudne commented 9 months ago

UTC(*). indexed_date is taken from parsed_date, if present, or is the current UTC date/time: https://github.com/mediacloud/story-indexer/blob/main/indexer/workers/importer.py#L139

parsed_date is added here: https://github.com/mediacloud/story-indexer/blob/main/indexer/workers/parser.py#L135

BUT for stories loaded from WARC files written before parsed_date was added (ie; everything reloaded earlier this month), parsed_date is taken from the timestamp on the story in the WARC file (which to be least common denominator, does not contain fractional seconds), in this code: https://github.com/mediacloud/story-indexer/blob/main/indexer/story_archive_writer.py#L367

So new parsed_dates will have microseconds (including those created from "historical" data, which will be the date the old article was parsed), but ones reloaded from WARCs (written before parsed_date was added) will not.

(*) in accordance with my assertion that picking a "local" timezone in an application deployed or used in multiple timezones would be favoritism. Datetimes and timezones are a pain, Python somehow manages to make it worse, as I wrote here recently: https://github.com/mediacloud/story-indexer/blob/main/indexer/workers/hist-queuer.py#L94

As in the above code, adding " +00:00" to indexed_date and feeding it to datetime.datetime.fromisoformat should give you an unambiguous (non-naive) datetime, from which a timestamp can be generated, if needed.

philbudne commented 9 months ago

P.S. I have less knowledge of what happens inside ES (except that I seem to recall the default resolution is in milliseconds, so I wouldn't be surprised if you never see more than three decimal places on retrieval). All of the above is what we DELIVER to ES, I suppose we could still be buggered by ES interpreting what we pass to is as in the local time zone (EEEEEEK!)

rahulbot commented 9 months ago

Thanks. This (UTC) matches what I'm seeing in a trace test I was doing. That explanation helps a lot.