tdwg / bdq

Biodiversity Data Quality (BDQ) Interest Group
https://github.com/tdwg/bdq
42 stars 7 forks source link

TG2 - Time to write code for DQ tests #150

Open Tasilee opened 6 years ago

Tasilee commented 6 years ago

Matthew Collins (via email 27th August 2018)

Without meaning to imply that iDigBio won't be writing code to implement the tests, is there an opportunity here for us to write code with others and for others? Paul, it looks like David is who's committing code to the FP repos, who else do you want to approach for figuring out how to modularize and implement this code? ALA developers, GBIF, Canadian Museum, ??

Tasilee commented 6 years ago

Most certainly Matthew. @chicoreus has already written most of the code (Java) but we know that the ALA, GBIF and iDigBio are committed to implementing these tests. So, I would hope that we could identify the key developers in at least these agencies that would be committed to implementation to get their heads together (obviously with @chicoreus and other interested people) to ensure we can pool knowledge and help one another to develop a quality outcome. The evaluation will be via all implementations producing the same results from the yet to be developed test dataset.

mjcollin commented 6 years ago

List so far based on TDWG conversations: @tucotuco is interested in doing test data, Matt Blissett (GBIF) is interested but has a hardware refresh to do first, @nrejack (iDigBio) will help but also has a hardware refresh to do, Simon Checksfield (CSRIO/ALA) said he'd identify someone specific to help via email. And @chicoreus of course.

I think this is a representative enough group to start with an email thread/video call when we all get back from NZ. Any other suggestions for interested people are welcome.

Tasilee commented 6 years ago

To the task. @peterdesmet presentation yesterday suggested that we may have a solution to the GENERIC code solution with Whip....that looks very interesting.

I particularly like the humans AND machine readable nature!

Tasilee commented 6 years ago

Thanks @mjcollin. I’ll fork out a new issue on the test data now (next). We have not progressed that as far as @chicoreus has done with code...but academic until we finalised the tests....which I expect before the end of TDWG 2018.

cgendreau commented 6 years ago

That's great news. You can add me to the list.

ArthurChapman commented 6 years ago

John Waller from GBIF should also be involved

Tasilee commented 6 years ago

FULLY agree @ArthurChapman ...can anyone find him on GitHub and add him onto this issue? I am a GitHub idiot.

jhnwllr commented 6 years ago

@ArthurChapman John Waller is me.

Tasilee commented 6 years ago

:)

ArthurChapman commented 6 years ago

I have added John to our list of admin

ianengelbrecht commented 6 years ago

Thank you for the invitation @mjcollin. In terms of architecture, would it be worthwhile to have separate Github repositories or submodules for each language we implement in? It might make it easier in terms of managing the process.

chicoreus commented 6 years ago

@Tasilee, in Dunedin, @allankv and I sat down with @peterdesmet and took a look at whether Whip might be suitable for this. We concluded that Whip's tests are too fine grained to match to the tests in the TG2 test set, that the framework (and the current test suite) are richer than Whip's expressivity.

Tasilee commented 6 years ago

Thanks @chicoreus. I got that impression myself in Dunedin. Maybe Whip will get there at some point not to far into the future. So, as I think we said elsewhere, your code seems like the place to start.

jhpoelen commented 6 years ago

hey y'all -

I've been using this tool called Elton that has a check feature for reviewing species interactions datasets. Right now the output format looks like (from travis build) :

namespace message
local found invalid (lat,lng) pair: (-105.1196,38.05489428)
local not setting collection date, because [1945-10.3] could not be read as date.
local not setting collection date, because [1994-6.6] could not be read as date.
local 39520 interaction(s)
local 0 error(s)
local 28 warning(s)

Obviously, I'd very much like to improve this with all sort of things like error codes, validation levels etc. Since ALA, GBIF, and Elton (and GloBI) are written in java, I'd very much like to re-use some simple java validation test logging library. Is there anyone working on such a thing?

I'd imagine it would look something like:


ValidationContext ctx = new ValidationContext("https://example.org/dataset.zip", ...);

new ValidationReporter(ctx).report(Level.ERROR, ErrorCode.CRAPPY_DATE, "[mickey mouse] is not a valid date")

Curious to hear your response.

chicoreus commented 1 year ago

FilteredPush event_date_qc library (Java), version 3.0.0 released 10.5281/zenodo.596795. Passes all validation cases for TIME tests in the validation data TG2_test_validation_data.csv and TG2_test_validation_data_nonprintingchars.csv) Uses the Kurator ffdq-api library for the result objects.

ArthurChapman commented 3 months ago

@mjcollin This has progressed and most of the code has been written and tested against a Validation Test dataset. We are now writing up the documentation etc. prior to submitting to the Executive for a new data quality standard - bdq Core. @chicoreus - is there anything we need to liaise with iDigBio on?

Tasilee commented 3 months ago

Not that I aware of.

chicoreus commented 3 months ago

Test implementations in Java that are passing (almost all, 100 test implementations passing all rows of validation data, 1293 rows of validation data, 5 failure cases in 4 tests, and 20 cases not tested from no test implementation one of which is intended to be implemented) are in:

The test implementations, enabled by the fittness for use framework, define APIs where Darwin Core terms that form the information elements for a test, in a single record, are presented to the test method, and the test returns a framework Response object, consisting of at least Response.status, Response.Result, and Response.comment.

The tests are independent of a test execution framework, and in such a framework, logging of the Response objects could easily be added. But, as the tests are, by design, agnostic to the framework and data flows they are implemented in (they are, for example, agnostic to whether they are being run row by row on data, or are being run on unique values and then being expanded to match rows in a data set), this logging would need to be added within test execution frameworks. The response objects do have the potential to be used for logging, and would facilitate @jhpoelen's suggestion of a consistent logging framework.

jhpoelen commented 3 months ago

@chicoreus Neat to see how a java implementation of the DQ Tests took shape over the years. Wow!

Also, kudos for making the java library available in Maven Central . . . so much easier for reuse.

When would be a good time for me to integrate this into Elton, Nomer, Preston or other tools I contribute to?

chicoreus commented 3 months ago

On Tue, 27 Aug 2024 03:23:43 -0700 Jorrit Poelen @.***> wrote:

When would be a good time for me to integrate this into Elton, Nomer, Preston or other tools I contribute to?

Now (though we've still got work to do on the implementation guide https://github.com/tdwg/bdq/blob/master/tg2/_review/docs/implementers/index.md but questions from the process may help drive that) is a good time to start with the event_date_qc library. That will give a sense of how the tests are intended to operate. We made changes to specifications for a number of other tests last week, and the metadata in the code hasn't quite caught up yet.

jhpoelen commented 3 months ago

Thanks for your prompt reply @chicoreus !

Any chance I can convince you to put the test code and DateUtil class (and dependencies) in a separate module. This way, I can reuse the DateUtil independent of all the test suite stuff?

Please let me know if this is a silly question, am still catching up on this idea from 2018 ; )