opensanctions / crawler-planning

Task tracking for the crawlers we're working on
https://github.com/orgs/opensanctions/projects/2
5 stars 0 forks source link

Start date for each crawler #76

Closed jbothma closed 4 months ago

jbothma commented 4 months ago

We'd like to add a coverage.start value to the metadata of each crawler which we could then use to list crawlers by date on our website and automatically show what's new, and how frequently we're adding crawlers now that we've built up a team.

This will probably require some digging in git. Using timestamps from the data could very well lead to the wrong conclusions.

Note that we did a big refactor, moving crawlers into the datasets directory, around August 2023.

So this gives the wrong date

$ git log --follow --reverse datasets/ru/rupep/ru_rupep.yml 
commit 8d0d64646bab08f71f029dacfe0f774435aa7bbc
Author: Friedrich Lindenberg <friedrich@pudo.org>
Date:   Fri Aug 11 07:56:38 2023 +0200

    move all the crawler metadata

This gives the right date

$ git log --follow datasets/ru/rupep/ru_rupep.yml | tail -5
commit 2be191cd6f6e4182f2ae70f9e4b341b56dd5d879
Author: Friedrich Lindenberg <friedrich@pudo.org>
Date:   Fri Feb 25 13:34:05 2022 +0100

    Ingest rupep data

as can be confirmed from the commit message.

jbothma commented 4 months ago

@pudo

is coverage.start the right key for listing when datasets were added? how would you feel about calling it coverage.created instead?

The current need is to express when the dataset was added to opensanctions, right? if we didn't already have coverage.end we'd call it created or something?

for a snapshot dataset, coverage.start should be the same as coverage.end if we're adding it after the snapshot was taken so start <= end. then it won't show up in our "recently added" list

coverage.start sounds like we're saying "the period of updates covered by this source is from start to end", but I'm not really sure what that means when we're always publishing its current state. Unless we mean we have point in time snapshots of this data over this period. Is that the intent?

feel free to tell me to just call it coverage.start and not start something :)

pudo commented 4 months ago

I don't have very strong emotions about this, I put in coverage.start in NK based on hearing that term used a bit in other metadata standards (e.g. Rufus calls it temporal start/end: https://specs.frictionlessdata.io/data-package/ ), DC and Google use complex ISO range formats that we don't support:

In any case, if we want to change it, let's make a PR here: https://github.com/opensanctions/nomenklatura/blob/main/nomenklatura/dataset/coverage.py#L22

jbothma commented 4 months ago

Completed in https://github.com/opensanctions/opensanctions/pull/611