pangaea-data-publisher / fuji

FAIRsFAIR Research Data Object Assessment Service
MIT License
53 stars 39 forks source link

[Feature]: Improve the reading of EML metadata #543

Open kmexter opened 1 week ago

kmexter commented 1 week ago

Detailed Description

Following on from https://github.com/pangaea-data-publisher/fuji/issues/542 (the same meeting with @huberrob and colleagues). I also tested out a few metadata records that use the EML schema (XML format: https://eml.ecoinformatics.org/eml-schema#). I am not sure that FUJI is reading this 100% well, and this is my analysis

https://www.eurobis.org/imis?module=dataset&dasid=848&show=eml which is in EML 2.1.1 returns initial for F, A, I, R, Comments on the failed fields:

https://marineinfo.org/id/dataset/8357-eml-2.2.0.xml for eml 2.2.0. Also returns initial for F, I, R and moderate for A. My additional comments (i.e. not repeating those of above)

Both records fare badly on the check for semantic resources. I am not sure if this is because it cannot find them or it does not recognise those vocabularies. So FYI these are vocabs we use often in biodiversity MarineRegions - https://www.marineregions.org/about.php MarineSpecies (aka aphia) - https://www.marinespecies.org/ NCBI taxonomy - https://www.ncbi.nlm.nih.gov/taxonomy BODC vocabularies via NNV - http://vocab.nerc.ac.uk/ Environmental Ontology - see https://www.ebi.ac.uk/ols4/ ASFA - see https://aims.fao.org/network-fisheries-ontologies

FYI EML version 2.1.1 is used by the biodiversity databases OBIS and Gbif (and their respective regional nodes) and GBif have recently updated to EML 2.2.0. This new version has some extra features related to how to "tag" resources with semantics. If you don't do so already, you may want to look at this.

Context

Biodiversity databases use EML so any improvements would be very useful for checking those metadata. We would be interested in any feedback - for example, if EML needs more standardisation so that its fields are better "tagged" as being of a particular type, we can pass on that on to EML via its GH.

Possible Implementation

huberrob commented 1 week ago

Dear Katrina,

I think the main problem is that the 848 XML is a EML variant from GBIF and is including some metadata within GBIF specific fields. This part starts with XML element <additionalMetadata><metadata> and this is defined as xs:any which means that at this point you can add GBIF specific XML (https://eml.ecoinformatics.org/schema/)

Further, I think there is an identity problem with this dataset which seems rather to be a catalog entry linking to many representations and several different identifiers at EMODNET, EUROBIS, OBIS etc. There is:

F-UJI was designed to assess datasets not catalog entries so it may not be the right tool for this record. Anyway, to answer your questions:

kmexter commented 1 week ago

Further, I think there is an identity problem with this dataset which seems rather to be a catalog entry linking to many representations and several different identifiers at EMODNET, EUROBIS, OBIS etc.

Now that is interesting. I always assumed FUJI analysed metadata descriptions of datasets, which to me is what a catalogue entry is. Is there a specification that you use to define "dataset" vs "catalogue entry" -> useful for us to know if so.

For the rest - yes, Gbif annoyingly added their own things to EML, but to be fair they needed to do that at the time eml first came out. I will look at your comments and will make recommendations to EML and GBif and EurOBIS as to improvements via their respective issues (tho I will not hold my breath in getting any rapid action therefrom) - I have been meaning to do that for a while but keep putting it off.... For VLIZ, we have our own eml 2.2 profile that we put together and I will have a look at your comments to see if we can make some improvements there, and I can pass one super-complete catalogue record on to you when we have done that, if that is useful at all? I mean, if EML is not really a good match to FUJI, is it worth it for you?

For licence: so you look for a formal licence AND access conditions separately? Fair enough - it is up to the data provider to decide if they want to provide both or not.

huberrob commented 1 week ago

Yes difficult to explain, a data catalog lists a catalog entry which refers to a dataset In your example a data catalog entry refers to another data catalog entry etc

F-UJI doesn't know which 'type' an entity it is testing it will test whatever you give. It is up to the user to decide what is useful or not.

For example this page https://datasetsearch.research.google.com/search?docid=L2cvMTFsdjRoa3o1NQ%3D%3D looks as if it is the same as https://obis.org/dataset/066f002f-58d5-4687-bdb8-b39cdaef0c2b but is it?
And on https://obis.org/dataset/066f002f-58d5-4687-bdb8-b39cdaef0c2b the user finds a link to https://ipt.vliz.be/upload/resource?r=arms_coi_2018-20 having the same title than the other two entities. So which one is the main entity, the 'dataset'?

kmexter commented 1 week ago

hmm right. In our catalogue entries we list all the places you can find the dataset, and yes, they will probably not all be exactly the same - for example, eurobis, obis, and gbif all have the same data (as they harvest from e/o) but they do not have exactly 100% the same content in their data (the same data format but slight difference in organisation therein) and so all those data links will be in the metadata record. Useful to know - since it helps analyse the results!

My question about EML above - are you interested in an updated example or is this not in scope?

huberrob commented 1 week ago

Yes please ;) I am very interested in good EML examples

mpo-vliz commented 1 week ago

@huberrob this starts to sound we could be fixing our eml first? (rather then already a clear feature / improvement for FUJI)

I mean, we kind of "know" its about a dataset there, so how should we best make that clear to FUJI?

Also (have not checked in detail, but) the four URL you mentioned there look like they very well could be referring to the same thing (that very dataset) -- again something we could be making clear to FUJI (some set of same-as, about, subjectOf relations to be provided?)

-- to be honest, somewhat streamlining this bulk of historic URL that have been doubling as "(false?) identifiers" for our datasets has become a concern we would like to tackle in a nice way - so any advise, opinion, suggestion from others is highly welcome (but, granted, not your problem)

Anyway, point is: we control what goes into those eml to a large extend, so we can make it work and build some practical testcases for you ;-) (me trying to make this a win-win :wink: )

huberrob commented 1 week ago

Oh, you already helped to improve F-UJI ;) As far as I understood Katrina, each of these entities may be slightly different and some how originate from a 'master' dataset. So ideally you could indicate the provenance of these datasets as you proposed but instead of sameAs I would propose to use isBasedOn or something like this? I am not sure if this can be done in EML, but I assume you have something but if each entity already has something like Dublin Core or schema.org this would be a good place.

But F-UJI would not follow these links to determine a overall FAIRness of 'that very dataset' because it is impossible to verify if e.g. claimed sameAs links really are about the 'same' dataset.

Regarding these historic URLs I would recommend to HTTP redirect them to the 'master dataset' ?

kmexter commented 1 week ago

Hmm, it can be hard to identify the master dataset (a-postiori) because everyone harvests from each other in all directions (plus, it is a lot of human resources to track that down), but to do that a-priori is more feasible. I have my doubts that eml can handle any of this in its schema profile (Laurian, my colleague, and I already exhausted many of its possibilities) -> one would be going the gbif route and adding one's own fields. But Laurian and I will have a look at what one can do more within EML, before end of year at least.

I also know that many of the data portals that use eml also export those metadata records (so again, not dataset-metadata, but metadata records about a singe dataset) in json ld.