At time of scrape, store a copy of all extensions used

odscjames commented 5 years ago

Scenario:

We scrape some data that uses extensions
3 months pass .......
The extension changes, and it's an unversioned extension, so it just changes
3 months pass .......
We load the data again because we want to evaluate it, so we load it, and check it

PROBLEM! The data now fails validation because we are checking old data against a new extension schema, and a bunch of fields are missing , wrong type, etc .... this is unfair! We really need to be checking the data against the extension at the time of the original scrape!

SOLUTION: When we get data, also get copies of the extensions, schemas, codelists, etc .... save that alongside files! When we recheck later, use these copies.

ARE EXTENSIONS VERSIONED OR NOT? Obviously, if the extension is versioned properly and we trust that versioning to be done well then this won't be a problem at all. So the question is - how many extensions are unversioned? How bad a problem is this?

Realised when reading https://github.com/open-contracting/lib-cove-ocds/issues/9#issuecomment-468706578

jpmckinney commented 5 years ago

Most extensions are unversioned, and that's not likely to change to the point that most are versioned, i.e. we will have to handle the case of unversioned extensions for the foreseeable future.

Thinking through different scenarios: If a publisher has data, then changes its extensions in a backwards-incompatible way, but doesn't update its old data, then that data should fail. It doesn't matter that, at one time, its data and schema matched such that it would pass. We require that any presently accessible data match any presently referenced schema.

So, in the above scenario, if the publisher did go back and change their old data to match the updated extension, we should re-download that old data before re-checking it.

If they didn't go back, and now their old data has errors according to the updated extension, then that is a true error and isn't unfair.

Publishers shouldn't be making backwards-incompatible changes to their extensions, and if they are, then they should at least version them (or publish them at different URLs).

robredpath commented 5 years ago

We're planning on keeping a record of pretty much everything we ever do on Kingfisher, right? So, we should still be able to say with confidence that a certain publishers' data passed validation on a certain date, even if we can't now reproduce that because the extensions that it uses have changed?

jpmckinney commented 5 years ago

Yes to the second question – I don't think we have a use case for re-checking year-old data against year-old extensions, but we do have a use case to say "publisher X passed validation at time Y" – though, regarding the first question, that doesn't necessarily need database support, as we'll have logged that fact in feedback reports, MEL measurements, etc.

odscjames commented 5 years ago

Thinking through different scenarios:

There are 2 different scenarios here tho.

We get all data and check it, 6 months pass, we get the latest copy of all data again and check it again - that's fine. In this case the latest version of the data should match the current version of the extensions, and that will be what is checked.
We get all data and check it, 6 months pass, we want to know something about that data so we local load the old version and check it again. This is where the problem rises.

Maybe the second scenario is very unlikely, but one of our analysts is doing just that right now (because the publisher has stopped publishing).

We're planning on keeping a record of pretty much everything we ever do on Kingfisher, right?

Are we planning on doing that outside Kingfisher? At the moment if you delete a collection from Kingfisher you delete the check results too.

jpmckinney commented 5 years ago

For the second scenario, it makes sense to store the schemas, etc. somewhere. So, we might as well store them for all scenarios.

Are we planning on doing that outside Kingfisher?

Leave it up to the user. Feedback reports and MEL reports will mention results; granular results can then be discarded. When Kingfisher is used for another purpose, I assume the relevant results will be captured at least as prose somewhere… If anyone uses Kingfisher and never reports any results anywhere else, then I assume that person won't be deleting their collections…

jpmckinney commented 5 years ago

These commits might be relevant in the old Kingfisher:

https://github.com/open-contracting/kingfisher/commit/59a131b5164b0cca663295bd2896636f98628145 https://github.com/open-contracting/kingfisher/commit/03f969b3aede3d2c8134626bfa21fe1f0621c623 https://github.com/open-contracting/kingfisher/commit/c20cadc55dcc6e0c1d1439ffb2f998191fc6d5c6

jpmckinney commented 4 years ago

The next version of the Extension Registry Python Package means that, if Kingfisher Process downloads all unique extensions referenced by packages (e.g. after closing the collection), then ProfileBuilder can use those downloaded extensions to generate an ad-hoc 'profile', which can be made available to other steps (e.g. the check step – if/when lib-cove-ocds allows passing in a schema) – so that those don't need to be retrieved at the time the check is performed.

jpmckinney commented 4 years ago

The Extension Registry Python Package can now generate extended package schema (like CoVE).

jpmckinney commented 2 years ago

@yolile Do you think this feature is needed? We can do it (e.g. update the collection when the first release is merged, and assume that all releases use the same extensions), but I'm not sure anyone has ever needed this.

yolile commented 2 years ago

I'm not sure anyone has ever needed this.

I'm not sure either. The only case I can recall is a publisher using an old version of OCDS for PPPs, but in that case, the problem was they were using and declaring the old version of the extension instead of the most recent one.

In general, I think the higher-risk scenario is when a partner uses a community extension, and the extension owner updates the extension for their own purpose and doesn't communicate this to anyone. But this will be a problem for any data user or tool. So maybe it is more a problem for how the extensions work rather than for kingfisher process. For us using kingfisher process, I guess we can always manually check if a dataset is failing due to problems with an extension, check the changes of that extension in GitHub or similar, and check if the issue is the publisher's concern or not, and if maybe they need to create their own extension now. So for me, it is better to actually raise the validation error rather than kind of hide it.

jpmckinney commented 2 years ago

Sounds good 👍

open-contracting / kingfisher-process

At time of scrape, store a copy of all extensions used #122