Closed jbothma closed 4 months ago
@pudo
is coverage.start
the right key for listing when datasets were added? how would you feel about calling it coverage.created
instead?
The current need is to express when the dataset was added to opensanctions, right? if we didn't already have coverage.end
we'd call it created or something?
for a snapshot dataset, coverage.start should be the same as coverage.end
if we're adding it after the snapshot was taken so start
<= end
. then it won't show up in our "recently added" list
coverage.start sounds like we're saying "the period of updates covered by this source is from start
to end
", but I'm not really sure what that means when we're always publishing its current state. Unless we mean we have point in time snapshots of this data over this period. Is that the intent?
feel free to tell me to just call it coverage.start
and not start something :)
I don't have very strong emotions about this, I put in coverage.start
in NK based on hearing that term used a bit in other metadata standards (e.g. Rufus calls it temporal start/end: https://specs.frictionlessdata.io/data-package/ ), DC and Google use complex ISO range formats that we don't support:
In any case, if we want to change it, let's make a PR here: https://github.com/opensanctions/nomenklatura/blob/main/nomenklatura/dataset/coverage.py#L22
We'd like to add a
coverage.start
value to the metadata of each crawler which we could then use to list crawlers by date on our website and automatically show what's new, and how frequently we're adding crawlers now that we've built up a team.This will probably require some digging in git. Using timestamps from the data could very well lead to the wrong conclusions.
Note that we did a big refactor, moving crawlers into the
datasets
directory, around August 2023.So this gives the wrong date
This gives the right date
as can be confirmed from the commit message.