ioos / bio_data_guide

Standardizing Marine Biological Data Working Group - An open community to facilitate the mobilization of biological data to OBIS.
https://ioos.github.io/bio_data_guide/
MIT License
45 stars 21 forks source link

GHA for Darwin Core quick check? #85

Open MathewBiddle opened 2 years ago

MathewBiddle commented 2 years ago

Would folks find it useful if we set up a GitHub Action that does a preliminary review of Darwin Core files? For example, a data manager sends in a PR with an event, occurrence, and/or emof file in a directory called data/processed . Then, the Action would pull in those files and do some initial checks (headers are valid, gives some summary statistics, other basic checks...) and do something with the results (save a summary file, throw an error...).

Just trying to make the IPT managers lives a little easier as some of the issues I've seen could have been resolved earlier in the process with a simple checker.

emiliom commented 2 years ago

That's an interesting prospect! Would that involve the development of a new validation package (to be used by the GHA), or reuse of services listed in https://ioos.github.io/bio_data_guide/tools.html#validators ?

albenson-usgs commented 2 years ago

Stace suggested looking at this https://rshiny.lifewatch.be/BioCheck/

MathewBiddle commented 2 years ago

I like all those resources, however they require a full Darwin Core Archive package (including eml.xml and meta.xml) or something loaded into an IPT. That seems burdensome to me when half of the issues are with the csv files (incorrect lat/lons, aphia IDs, duplicate obs/IDs). I'd like something that can look at the csv files and give a quick check for a data manager to address before getting to the metadata or IPT loading part.

I wouldn't want to duplicate effort, so if there is an API we can use from any of those resources, that would be a great first step.

7yl4r commented 2 years ago

I think any work to encourage data managers is great but I do wonder if GitHub actions is the right tool for this job.

I have some code that would help here:

  1. a data validation server for MBON data marinebon/mbon_data_uploader
  2. various jupyter notebooks to do basic checks on csv & DwC files

I think it is worth doing a pro/con on using gh actions + PR vs a hosted server + html form submission.

The GH actions approach is free but has more technical limitations. But most importantly: I think that data managers are going to be more comfortable filling out a web form vs submitting a PR.

albenson-usgs commented 2 years ago

I still have no idea how to submit a PR 🙈 Realize I'm not exactly the target audience for the Darwin Core quick check but just confirming what Tylar says that it might be a bridge too far for most data managers.

emiliom commented 2 years ago

I agree that a validator that doesn't depend on PR's is best. Like the IOOS compliance checker, even better if it's both a package that can be installed and run locally and something that can be deployed on the web and accepts uploading of files or pointing a url to it.

@MathewBiddle, I also agree that those existing validators impose an additional barrier in requiring an IPT submission package. For those that don't actually require the package (zip file?) to have been previously submitted, could a package be faked by creating dummy eml.xml and meta.xml files on the fly?). It'd be lovely, for example, if https://rshiny.lifewatch.be/BioCheck/ could accept such a package; the user could then just ignore errors related to the metadata files.

A GHA initiated by a PR could be helpful if you think a semi-formal route of aligned data reviews via the GH repo adds value to everyone involved.

MathewBiddle commented 2 years ago

from @sbeaulieu see https://github.com/EMODnet/EMODnetBiocheck for the under the hood code in the lifewatch tool.

7yl4r commented 2 years ago

My proposal: we set up a github repo for this that works with mybinder.org. A user would use this by:

  1. host your .csv (or DwC archive or whatever) and get the URL
  2. pass the data URL as a URL parameter into the mybinder.org link
  3. the notebook will autorun & display the report for the data at the URL passed in
MathewBiddle commented 2 years ago

xref: https://cioos-siooc.github.io/pyobistools/index.html

might be able to put that checker in a GH Action which runs on csv files found in data/processed/ (see example GH Action running python script).

Could also use the https://github.com/iobis/obistools R package as well...

jdpye commented 1 year ago

OGSL is using the pyobistools functions to do a similar thing internally for their group. I really think it's going to be possible once we get a shiny version of it up on PyPI.

7yl4r commented 1 year ago

I have done some heretoforth unreported work on this too. Below is an unordered summary of resources that might be helpful for this effort:

  1. my experiments with gh-action validation of DwC .csv files using frictionless data
  2. OSGL's internal pyobistools validation efforts
  3. pyobistools & iobis/obistools : sister libraries for doing OBIS QC
  4. EMODnet/EMODnetBiocheck check DwC.zip files
  5. list of tools on bio_data_guide
  6. pieter's internal check scripts obis-qc
  7. This publication outlines all the checks that EurOBIS does