pangaea-data-publisher / fuji

FAIRsFAIR Research Data Object Assessment Service
MIT License
51 stars 36 forks source link

[Feature]: Assess single files of zipped data sets in reusability score #508

Closed jonaslenz closed 2 months ago

jonaslenz commented 3 months ago

Descritpion

I was using F-UJI to access fairness of two datasets, identified by DOI:

  1. 10.1594/PANGAEA.736780 and
  2. 10.1594/PANGAEA.885492

While 1. ssems to me like a well prepared reusable example, for which pangaea can access single data values, example 2. comes as zip archive, containing several csv files and also one excel file. In 2. the table of Locations.csv contains at least on typo, hindering machine readybilty (a minus in a numeric column is replaced by the character "ÿ").

Both datasets score quite comparable in F-UJI regarding reusabilty, so I was wondering if it would be possible to include a reusabilty assessment of single files within a zip archive.

Possible Implementation

If data is machine readable (reusable) statistic metrics could be used to test reusability. E.g. if metadata would indicate that values of Latitude of the dataset are in between x and y this could be evaluated by the testing machine.

huberrob commented 2 months ago

F-UJI already is checking zipped files for R assessments but is not looking at the details of every single file. It takes a sample of each file type and checks e.g. mime type for R1.3-02. The reason is that some datasets contain hundreds or thousands of individual files which would make deep content tests very expensive in terms of memory consumption and download volume. F-UJI only superficially investigates the content, e.g. it tries to verify if measurement variables expressed in metadata are listed in the data content e.g. as column headers. The tests (lat lon check) you propose are very (discipline) specific and beyond the scope of FAIR testing. You could however try to use pangaeapy for PANGAEA datasets and use this library to implement such specific tests.