Open MathewBiddle opened 2 years ago
That's an interesting prospect! Would that involve the development of a new validation package (to be used by the GHA), or reuse of services listed in https://ioos.github.io/bio_data_guide/tools.html#validators ?
Stace suggested looking at this https://rshiny.lifewatch.be/BioCheck/
I like all those resources, however they require a full Darwin Core Archive package (including eml.xml and meta.xml) or something loaded into an IPT. That seems burdensome to me when half of the issues are with the csv files (incorrect lat/lons, aphia IDs, duplicate obs/IDs). I'd like something that can look at the csv files and give a quick check for a data manager to address before getting to the metadata or IPT loading part.
I wouldn't want to duplicate effort, so if there is an API we can use from any of those resources, that would be a great first step.
I think any work to encourage data managers is great but I do wonder if GitHub actions is the right tool for this job.
I have some code that would help here:
I think it is worth doing a pro/con on using gh actions + PR vs a hosted server + html form submission.
The GH actions approach is free but has more technical limitations. But most importantly: I think that data managers are going to be more comfortable filling out a web form vs submitting a PR.
I still have no idea how to submit a PR 🙈 Realize I'm not exactly the target audience for the Darwin Core quick check but just confirming what Tylar says that it might be a bridge too far for most data managers.
I agree that a validator that doesn't depend on PR's is best. Like the IOOS compliance checker, even better if it's both a package that can be installed and run locally and something that can be deployed on the web and accepts uploading of files or pointing a url to it.
@MathewBiddle, I also agree that those existing validators impose an additional barrier in requiring an IPT submission package. For those that don't actually require the package (zip file?) to have been previously submitted, could a package be faked by creating dummy eml.xml and meta.xml files on the fly?). It'd be lovely, for example, if https://rshiny.lifewatch.be/BioCheck/ could accept such a package; the user could then just ignore errors related to the metadata files.
A GHA initiated by a PR could be helpful if you think a semi-formal route of aligned data reviews via the GH repo adds value to everyone involved.
from @sbeaulieu see https://github.com/EMODnet/EMODnetBiocheck for the under the hood code in the lifewatch tool.
My proposal: we set up a github repo for this that works with mybinder.org. A user would use this by:
xref: https://cioos-siooc.github.io/pyobistools/index.html
might be able to put that checker in a GH Action which runs on csv files found in data/processed/
(see example GH Action running python script).
Could also use the https://github.com/iobis/obistools R package as well...
OGSL is using the pyobistools functions to do a similar thing internally for their group. I really think it's going to be possible once we get a shiny version of it up on PyPI.
I have done some heretoforth unreported work on this too. Below is an unordered summary of resources that might be helpful for this effort:
Would folks find it useful if we set up a GitHub Action that does a preliminary review of Darwin Core files? For example, a data manager sends in a PR with an event, occurrence, and/or emof file in a directory called
data/processed
. Then, the Action would pull in those files and do some initial checks (headers are valid, gives some summary statistics, other basic checks...) and do something with the results (save a summary file, throw an error...).Just trying to make the IPT managers lives a little easier as some of the issues I've seen could have been resolved earlier in the process with a simple checker.