Closed rahulbot closed 9 months ago
UTC(*). indexed_date is taken from parsed_date, if present, or is the current UTC date/time: https://github.com/mediacloud/story-indexer/blob/main/indexer/workers/importer.py#L139
parsed_date is added here: https://github.com/mediacloud/story-indexer/blob/main/indexer/workers/parser.py#L135
BUT for stories loaded from WARC files written before parsed_date
was added (ie; everything reloaded earlier this month), parsed_date is taken from the timestamp on the story in the WARC file (which to be least common denominator, does not contain fractional seconds), in this code:
https://github.com/mediacloud/story-indexer/blob/main/indexer/story_archive_writer.py#L367
So new parsed_dates will have microseconds (including those created from "historical" data, which will be the date the old article was parsed), but ones reloaded from WARCs (written before parsed_date was added) will not.
(*) in accordance with my assertion that picking a "local" timezone in an application deployed or used in multiple timezones would be favoritism. Datetimes and timezones are a pain, Python somehow manages to make it worse, as I wrote here recently: https://github.com/mediacloud/story-indexer/blob/main/indexer/workers/hist-queuer.py#L94
As in the above code, adding " +00:00" to indexed_date and feeding it to datetime.datetime.fromisoformat
should give you an unambiguous (non-naive) datetime, from which a timestamp can be generated, if needed.
P.S. I have less knowledge of what happens inside ES (except that I seem to recall the default resolution is in milliseconds, so I wouldn't be surprised if you never see more than three decimal places on retrieval). All of the above is what we DELIVER to ES, I suppose we could still be buggered by ES interpreting what we pass to is as in the local time zone (EEEEEEK!)
Thanks. This (UTC) matches what I'm seeing in a trace test I was doing. That explanation helps a lot.
Every story has an
indexed_date
added to it. This date-time is returned to users as a property of stories as they page through them. This is helpful for systems that are looking for new stories, because they can pass in a filter clause on the query that uses theindexed_date
field to specific only stories after the specified date (ie. the latest they fetched last time). However, to do this properly we need to know what timezoneindexed-date
is the returned in, and in what timezone it should send it as in order to filter properly.(BTW: I hate timezones)