tdwg / bdq

Biodiversity Data Quality (BDQ) Interest Group
https://github.com/tdwg/bdq
43 stars 7 forks source link

TG2 - Note Test dependencies and/or workflows #186

Open Tasilee opened 4 years ago

Tasilee commented 4 years ago

Beyond the agreed VALIDATE - AMEND - RE-VALIDATE, we will have a subset of tests that are dependent on a prior sequence of other tests having been run. We have a dependency worksheet that I have now updated: https://docs.google.com/spreadsheets/d/1cWxf1vABLHuO9g1NhpjYCHU_bXWkiV1NCenpx59pkkI but it will need editing given some changes to tests such as #147?

tucotuco commented 4 years ago

It's good to have the spreadsheet with all of the tests in one place to peruse, but I am concerned that the spreadsheet does not allow us to define the sequence - only the dependencies. In addition, dependency could be a little confusing, partly due to the question of which is dependent on which. Does it mean, I shouldn't be run until those in my dependency list are run? Or does it mean, those in my dependency list depend on me? I guess from the entries that it should be the former, but for the uninitiated, that might not be enough. In any case, the order of tests in a given sequence is what is really important, especially since it could be extremely useful to run some of them before and after a sequence of tests. The way it is now, it seems like there is some circularity in dependencies. The chain that includes considerations of coordinates and country codes is a good example, and a good one to make a model from.

Tasilee commented 4 years ago

Thanks @tucotuco. I just wanted the work that Arthur in particular put into that spreadsheet not to be lost in the path toward sequences. Given what we know about dependencies, it was always going to require something equivalent to a decision tree as @pzermoglio had done for date/time.

I agree that the spatial 'tests' (we need a better name BTW) would be a good place to start.

tucotuco commented 4 years ago

Is anyone in particular tasked with trying to make trees? Should we use Paula's model (where is it)? Should we distribute the task? Or better that someone does all the first drafts and then we all review?

On Wed, Jun 17, 2020 at 8:39 PM Lee Belbin notifications@github.com wrote:

Thanks @tucotuco https://github.com/tucotuco. I just wanted the work that Arthur in particular put into that spreadsheet not to be lost in the path toward sequences. Given what we know about dependencies, it was always going to require something equivalent to a decision tree as @pzermoglio https://github.com/pzermoglio had done for date/time.

I agree that the spatial 'tests' (we need a better name BTW) would be a good place to start.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tdwg/bdq/issues/186#issuecomment-645681868, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ7255EFGEI7ZFSI5VVNLRXFH3ZANCNFSM4OAHZK7Q .

Tasilee commented 4 years ago

This would be a nice AI application: Recursion and all that.

Regards @pzermoglio ’s diagram, I sent the XML file that Paula did from draw.io and was able to drag and drop it into that fine. I got that file from Paula’s original email to us. I’m on the iPad at the moment so will send link to the drawing when I get to a PC.

A decision tree would seem the easiest way to understand/communicate the sequences but another form may be more suited to a standard. Sharing load: Probably best as the overall task could be a lot for anyone. I’d prefer to focus on the test datasets in what time I have available, and as you all know, I am not one for detail :)

Tasilee commented 1 year ago

Can we agree that the tests are to be treated as independent, other than the overall process of VALIDATE-AMEND-VALIDATE? In other words, none of the tests are dependent on a workflow?

chicoreus commented 1 year ago

@Tasilee order does matter for a few Amendments.

Draft section 5.1.3:

5.1.3. Amendments where order is important (Normative)

When Amendments are executed in a workflow where downstream Amendments operate on data with the changes proposed by upstream Amendments applied, the following sequences SHOULD be followed. Similarly when Amendments are executed in parallel these sequences SHOULD be applied.

Give amendments propose a value to a primary term from secondary terms priority over those which back fill secondary terms from a primary term, AMENDMENT_EVENT_FROM_EVENTDATE SHOULD be run after the following Ammendments that propose changes to dwc:eventDate: AMENDMENT_EVENTDATE_FROM_VERBATIM, AMENDMENT_EVENTDATE_FROM_YEARMONTHDAY, AMENDMENT_EVENTDATE_FROM_YEARSTARTDAYOFYEARENDDAYOFYEAR, AMENDMENT_EVENTDATE_STANDARDIZED.

AMENDMENT_SCIENTIFICNAME_FROM_TAXONID SHOULD be run after this Amendment which proposes changes to dwc:TaxonID: AMENDMENT_TAXONID_FROM_TAXON

Where multiple Amendments on secondary terms could propose conflicting changes to a primary term, the sequence of Amendments SHOULD be ordered.

The following Amendments SHOULD be composed to run in an ordered sequence: first, AMENDMENT_EVENTDATE_FROM_VERBATIM, second, AMENDMENT_EVENTDATE_FROM_YEARSTARTDAYOFYEARENDDAYOFYEAR and finally AMENDMENT_EVENTDATE_FROM_YEARMONTHDAY,

Tasilee commented 1 year ago

Thanks @chicoreus - that makes sense.

Regarding a shared area where we can all see the standard doc, I haven't a clue on what could substitute for Google Docs. Any ideas?

chicoreus commented 1 year ago

On Fri, 23 Jun 2023 19:03:33 -0700 Lee Belbin @.***> wrote:

Thanks @chicoreus - that makes sense.

Regarding a shared area where we can all see the standard doc, I haven't a clue on what could substitute for Google Docs. Any ideas?

The logical place for long term maintinance would be a markdown document in the tdwg/bdq github repository. Similtaneous colaborative editing wouldn't be possible, but edits in forks with pull requests and merges of those serving as an editorial process would be supported.

Tasilee commented 1 year ago

Using github for the document would be good, but I'd suggest that option when we have a final draft. There will be a lot of changes over the next month or so.

ArthurChapman commented 1 year ago

I agree - GitHub is not an ideal platform to write a document. What is wrong with Google Docs? We all use it and it works well.

chicoreus commented 1 year ago

Some notes from TG2 call:

Validations are independent, internally order does not matter.

Some amendments are interdependent and order in which they are run (or the ordering for resolving conflicts) matters.

Most interdependencies are resolved by executing all validations (in parallel or in sequence), then when these are completed all amendments run (in parallel or in sequence, but with a few key interdependencies), then when these are completed run all validations (in parallel or in sequence) (or validations run where amendments proposed changes), with the second round of validations run on the data with proposed amendments applied. Comparing the validations pre-amendment and post-amendment gives an assessment of the data quality (for core purposes), and how much that quality would be improved by accepting all of the changes proposed by the amendments.

chicoreus commented 1 year ago

Minimal safe workflows, when passing SingleRecords through a data processing pipeline:

(1) Run all validations, in sequence, or in parallel.

(2) Run all amendments, with a few having a required sequence (otherwise in sequence or in parallel), then when complete, run all validations (in sequence or in parallel), with proposed changes from amendments applied to the data.

(3) Run all validations, in sequence, or in parallel, then when these are completed, Run all amendments, with a few having a required sequence (otherwise in sequence or in parallel), then when complete, run all validations (in sequence or in parallel), with proposed changes from amendments applied to the data.

Workflows that combine multiple pipelines of amendments and validations, need to be cognizant of information elements shared among tests and of the order dependence of a small set of annotations.

When operating on distinct values in a MultiRecord, attention needs to be paid to test interdependencies.