[Feature]: Improve the reading of EML metadata

kmexter commented 1 week ago

Detailed Description

Following on from https://github.com/pangaea-data-publisher/fuji/issues/542 (the same meeting with @huberrob and colleagues). I also tested out a few metadata records that use the EML schema (XML format: https://eml.ecoinformatics.org/eml-schema#). I am not sure that FUJI is reading this 100% well, and this is my analysis

https://www.eurobis.org/imis?module=dataset&dasid=848&show=eml which is in EML 2.1.1 returns initial for F, A, I, R, Comments on the failed fields:

F3-01M says it cannot find downloadable content but it is in there in ,
F4-01M it does not understand that EML is actually metadata that can be retrieved programmatically - it is xml following the eml schema,
I1-01M tho I am not sure if eml strictly-spreaking is metadata represented using a formal knowledge representation language so maybe that is correct,
I3-01M related resources are mentioned in but maybe it is looking for other types, so look to see if this fails for record 8357 (the next one) also since that does have a related paper and related datasets,
R1-01MD and R1.3-02D are right as we do not provide data-file info,
A1-01M I really am not sure what it is looking for here with access conditions that is different to the licence, but it is true we just have licence,
A1-03D hmm, maybe because it does not understand that is where to look for the data?

https://marineinfo.org/id/dataset/8357-eml-2.2.0.xml for eml 2.2.0. Also returns initial for F, I, R and moderate for A. My additional comments (i.e. not repeating those of above)

I3-01M it did not find the related publication, which we added via (note there is a class also)
F1-02D it did not find the DOI however I think that is a failing in EML - the URL or DOI to the resource itself is included in but there is no way to identify this as being "the URL of the metadata record itself". So I think EML should improve here rather than FUJI

Both records fare badly on the check for semantic resources. I am not sure if this is because it cannot find them or it does not recognise those vocabularies. So FYI these are vocabs we use often in biodiversity MarineRegions - https://www.marineregions.org/about.php MarineSpecies (aka aphia) - https://www.marinespecies.org/ NCBI taxonomy - https://www.ncbi.nlm.nih.gov/taxonomy BODC vocabularies via NNV - http://vocab.nerc.ac.uk/ Environmental Ontology - see https://www.ebi.ac.uk/ols4/ ASFA - see https://aims.fao.org/network-fisheries-ontologies

FYI EML version 2.1.1 is used by the biodiversity databases OBIS and Gbif (and their respective regional nodes) and GBif have recently updated to EML 2.2.0. This new version has some extra features related to how to "tag" resources with semantics. If you don't do so already, you may want to look at this.

Context

Biodiversity databases use EML so any improvements would be very useful for checking those metadata. We would be interested in any feedback - for example, if EML needs more standardisation so that its fields are better "tagged" as being of a particular type, we can pass on that on to EML via its GH.

Possible Implementation

huberrob commented 1 week ago

Dear Katrina,

I think the main problem is that the 848 XML is a EML variant from GBIF and is including some metadata within GBIF specific fields. This part starts with XML element <additionalMetadata><metadata> and this is defined as xs:any which means that at this point you can add GBIF specific XML (https://eml.ecoinformatics.org/schema/)

Further, I think there is an identity problem with this dataset which seems rather to be a catalog entry linking to many representations and several different identifiers at EMODNET, EUROBIS, OBIS etc. There is:

F-UJI was designed to assess datasets not catalog entries so it may not be the right tool for this record. Anyway, to answer your questions:

F3-01M : In EML dataset specific metadata should be provided in the<dataset><distribution> element which is not there. Instead, the <additionalMetadata><metadata><gbif><physical><distribution> which is GBIF specific. Further, all the links lead to catalogs or data repositories which then actually contain the data.
F1-02D The DOI should be listed in alternateIdentifier
F4-01M This test is checking if major search engines are supported so the dataset is searchable by them. The problem here is that the EML XML does not contain a link back to the dataset (catalog entry) webpage (https://www.eurobis.org/imis?module=dataset&dasid=848) which provides this information.
I1-01M F-UJI is expecting a formal knowledge representation language => RDF
I3-01M Related resources are not well covered by F-UJI's mapping and EML is not very well suited to include related resources, On F-UJI's TODO list should be mapping of literatureCited, referencePublication, usageCitation and probably otherEntity but I did not have good examples how this is done in real life such as 8357 which is using usageCitation. So thanks for this!
A1-01M is looking for information if the dataset is accessible or if it is restricted somehow. But I am not sure if this is possible with EML, maybe using intellectualRights?
A1-03D data links are not found

kmexter commented 1 week ago

Further, I think there is an identity problem with this dataset which seems rather to be a catalog entry linking to many representations and several different identifiers at EMODNET, EUROBIS, OBIS etc.

Now that is interesting. I always assumed FUJI analysed metadata descriptions of datasets, which to me is what a catalogue entry is. Is there a specification that you use to define "dataset" vs "catalogue entry" -> useful for us to know if so.

For the rest - yes, Gbif annoyingly added their own things to EML, but to be fair they needed to do that at the time eml first came out. I will look at your comments and will make recommendations to EML and GBif and EurOBIS as to improvements via their respective issues (tho I will not hold my breath in getting any rapid action therefrom) - I have been meaning to do that for a while but keep putting it off.... For VLIZ, we have our own eml 2.2 profile that we put together and I will have a look at your comments to see if we can make some improvements there, and I can pass one super-complete catalogue record on to you when we have done that, if that is useful at all? I mean, if EML is not really a good match to FUJI, is it worth it for you?

For licence: so you look for a formal licence AND access conditions separately? Fair enough - it is up to the data provider to decide if they want to provide both or not.

huberrob commented 1 week ago

Yes difficult to explain, a data catalog lists a catalog entry which refers to a dataset In your example a data catalog entry refers to another data catalog entry etc

F-UJI doesn't know which 'type' an entity it is testing it will test whatever you give. It is up to the user to decide what is useful or not.

For example this page https://datasetsearch.research.google.com/search?docid=L2cvMTFsdjRoa3o1NQ%3D%3D looks as if it is the same as https://obis.org/dataset/066f002f-58d5-4687-bdb8-b39cdaef0c2b but is it?
And on https://obis.org/dataset/066f002f-58d5-4687-bdb8-b39cdaef0c2b the user finds a link to https://ipt.vliz.be/upload/resource?r=arms_coi_2018-20 having the same title than the other two entities. So which one is the main entity, the 'dataset'?

kmexter commented 1 week ago

hmm right. In our catalogue entries we list all the places you can find the dataset, and yes, they will probably not all be exactly the same - for example, eurobis, obis, and gbif all have the same data (as they harvest from e/o) but they do not have exactly 100% the same content in their data (the same data format but slight difference in organisation therein) and so all those data links will be in the metadata record. Useful to know - since it helps analyse the results!

My question about EML above - are you interested in an updated example or is this not in scope?

huberrob commented 1 week ago

Yes please ;) I am very interested in good EML examples

mpo-vliz commented 1 week ago

@huberrob this starts to sound we could be fixing our eml first? (rather then already a clear feature / improvement for FUJI)

I mean, we kind of "know" its about a dataset there, so how should we best make that clear to FUJI?

Also (have not checked in detail, but) the four URL you mentioned there look like they very well could be referring to the same thing (that very dataset) -- again something we could be making clear to FUJI (some set of same-as, about, subjectOf relations to be provided?)

-- to be honest, somewhat streamlining this bulk of historic URL that have been doubling as "(false?) identifiers" for our datasets has become a concern we would like to tackle in a nice way - so any advise, opinion, suggestion from others is highly welcome (but, granted, not your problem)

Anyway, point is: we control what goes into those eml to a large extend, so we can make it work and build some practical testcases for you ;-) (me trying to make this a win-win :wink: )

huberrob commented 1 week ago

Oh, you already helped to improve F-UJI ;) As far as I understood Katrina, each of these entities may be slightly different and some how originate from a 'master' dataset. So ideally you could indicate the provenance of these datasets as you proposed but instead of sameAs I would propose to use isBasedOn or something like this? I am not sure if this can be done in EML, but I assume you have something but if each entity already has something like Dublin Core or schema.org this would be a good place.

But F-UJI would not follow these links to determine a overall FAIRness of 'that very dataset' because it is impossible to verify if e.g. claimed sameAs links really are about the 'same' dataset.

Regarding these historic URLs I would recommend to HTTP redirect them to the 'master dataset' ?

kmexter commented 1 week ago

Hmm, it can be hard to identify the master dataset (a-postiori) because everyone harvests from each other in all directions (plus, it is a lot of human resources to track that down), but to do that a-priori is more feasible. I have my doubts that eml can handle any of this in its schema profile (Laurian, my colleague, and I already exhausted many of its possibilities) -> one would be going the gbif route and adding one's own fields. But Laurian and I will have a look at what one can do more within EML, before end of year at least.

I also know that many of the data portals that use eml also export those metadata records (so again, not dataset-metadata, but metadata records about a singe dataset) in json ld.

pangaea-data-publisher / fuji