pangaea-data-publisher / fuji

FAIRsFAIR Research Data Object Assessment Service
MIT License
53 stars 39 forks source link

Comparison of Assessment Tools #69

Closed kitchenprinzessin3880 closed 3 years ago

kitchenprinzessin3880 commented 4 years ago

@ajaunsen to provide the list of identifiers with large differences in terms of scores generated by the tool.

ajaunsen commented 4 years ago

For one; I have observed that all (3) figshare based repositories do not score well in F-UJI. In fact they do not score on any metric (obviously apart from metric 1, FsF-F1-01D) when evaluating using the URL based reference to the dataset (ie. avoiding Datacite provided metadata).

This is in sharp contrast from the FDS (M Wilkinson) tool, in which these datasets scored 16 out of 22 indicators in July. However, I have checked their scoring again now and find that they now score 8 out of 22. Something seems to have changed in figshare's service and the difference is not dramatic as it first seemed comparing to evaluations in July.

The following datasets remain with large discrepancies;

This scores 13 out of 22 indicators in the FDS tool, while only metric 1 (FsF-F1-01D) pass in F-UJI. The evaluation is visible here: https://fair-latest.etais.ee/evaluation/6732

Similarly, 4 more that score 10 out of 22 indicators in the FDS tool, while only passing metric 1 (FsF-F1-01D) in F-UJI:

kitchenprinzessin3880 commented 4 years ago

The following datasets remain with large discrepancies; (Dataset 39) https://www.proteinatlas.org/ENSG00000110651-CD81/cell @ajaunsen This was discussed in this issue before @ https://github.com/pangaea-data-publisher/fuji/issues/62 F-UJI accepts resource types recommended by google dataset search (e.g., Dataset, Collection, etc). DataRecord is introduced by bioschema and it still under draft. Testing the same identifier using https://search.google.com/structured-data/testing-tool produced an exception, see below:

image

kitchenprinzessin3880 commented 4 years ago

(Dataset 114) https://metabolicatlas.org/explore/gem-browser/human1/reaction/GLYLYSCYSt Schema.org @type = Website, therefore F-UJI ignores it.

(Dataset 55) https://www.smhi.se/data/utforskaren-oppna-data/vattendrag-svar2012-datamangd Schema.org @type= BreadcrumbList,, therefore F-UJI ignores it.

huberrob commented 4 years ago

On the other hand, we do not treat identification of 'Resource Type' in DataCite metadata or Dublin Core in the same (strict) way. I think we should harmonize this and define a list of 'scientific resource types' an mappings for DC, schema.org and DataCite which we then use to raise a warning (1st step) after metadata collection.

kitchenprinzessin3880 commented 4 years ago

@huberrob - the issue is not only about resource types supported but also their associated metadata properties. datacite and dc do not have different sets of metadata elements for different resource types (there is only one schema for all types), whereas in the case of schema.org the properties may vary according to the sub-types of CreativeWork. mapping between these schemes have been done by RDA group, i will find out the link..

huberrob commented 4 years ago

Hi @kitchenprinzessin3880 ,

yes that's true.. on the other hand our mapping (query) would only find the correct metadata properties which are contained in the mapping. If they are in the right position and have the right name we should take them. But of course then we should raise a warning.. We should do the same thing for DC, DCAT and DataCite etc metadata records which also can(should) contain a resource type indication. IF these records do not indicate a resource type or the wrong one we also should raise a warning. I can create a ticket on this if you agree..

kitchenprinzessin3880 commented 4 years ago

ACTION 1 - From my view the best way forward is to validate the types using schema.org context (https://schema.org/docs/jsonldcontext.json) and output a warning in the debug message if the type specified is not dataset/collection or not exists in standard schema.org. @https://github.com/pangaea-data-publisher/fuji/issues/45 i addressed the need for metadata validation sometime ago.

kitchenprinzessin3880 commented 4 years ago

(Dataset 136-1) https://data.dtu.dk/articles/Data_for_the_paper_A_dual_reporter_system_for_investigating_and_optimizing_translation_and_folding_in_E_coli_/10265420

content-type of the response (data page) is NULL. according to https://tools.ietf.org/html/rfc7231#section-3.1.1.5,
A sender that generates a message containing a payload body SHOULD generate a Content-Type header field in that message unless the intended media type of the enclosed representation is unknown to the sender. If a Content-Type header field is not present, the recipient MAY either assume a media type of "application/octet-stream" ([RFC2046], Section 4.5.1) or examine the data to determine its type.

ACTION 2 - suggest redesigning request_helper.py class by using apache tika to infer the content type; see example of snippet below:

from tika import parser
parsedFile = parser.from_file(self.request_url )
status = parsedFile.get("status")
tika_content_types = parsedFile.get("metadata").get('Content-Type')
kitchenprinzessin3880 commented 4 years ago

(Dataset 85) http://tun.fi/JX.1099681 FUJI can identify RDFa included on the landing page, but the information included in not related to metadata of dataset, therefore the metric is failed, see below: 'rdfa': [{'@id': '_:N2653f5de968b4c9da2cc3be78b685afc', 'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#alert'}]}, {'@id': '_:N37d7ac4315524da18ece684377ec22db', 'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#dialog'}]}]}

It seems like Mark's tool assume this test is 'pass' as long as rdfa included in the data page.

ACTION 3 - Elaborate the exception message when processing rdfa metadata, e.g., 'rdfa found but not related to metadata of a data object'

kitchenprinzessin3880 commented 4 years ago

(Dataset 84) https://ortus.rtu.lv/science/en/datamodule/243

Metadata document is included using link relation type: <link rel="alternate" type="application/ld+json" href="http://scidata.vitk.lv/dataset/243.jsonld"/>

ACTION 4 - extend the class 'MetaDataCollectorRdf' to handle jsonld reserialization of rdf-based metadata. use package pip install rdflib-jsonld

kitchenprinzessin3880 commented 4 years ago

(Dataset 92) http://fel.hi.is/ISKOS1983

rdfa is detected by F-UJI, it seems like the information included is not about data (see figure and code snippet below) ACTION 5 - check the rdfa parsing in fair_check.py.

Capture

[{'@id': '/fotur_2', '@type': ['http://rdfs.org/sioc/ns#Item', 'http://xmlns.com/foaf/0.1/Document'], 'content:encoded': [{'@value': 'Hafðu samband\n\n\t\tSími: 525 4545\n\n\t\tNetfang: felagsvisindastofnun@hi.is\n'}], 'http://purl.org/dc/terms/title': [{'@value': 'Fótur 2'}]}, {'@id': '/fotur_4', '@type': ['http://rdfs.org/sioc/ns#Item', 'http://xmlns.com/foaf/0.1/Document'], 'content:encoded': [{'@value': 'Samfélagsmiðlar\nFélagsvísindastofnun á Facebook\nHáskóli Íslands á Twitter\nRSS veita HÍ\n'}], 'http://purl.org/dc/terms/title': [{'@value': 'Fótur 4'}]}, {'@id': '/fotur_3', '@type': ['http://rdfs.org/sioc/ns#Item', 'http://xmlns.com/foaf/0.1/Document'], 'content:encoded': [{'@value': 'Opnunartímar bygginga\nAðalbygging 07:30-17:00\nHáskólatorg 07:30-22:00\nAllir opnunartímar\n'}], 'http://purl.org/dc/terms/title': [{'@value': 'Fótur 3'}]}, {'@id': '/fotur_1', '@type': ['http://xmlns.com/foaf/0.1/Document', 'http://rdfs.org/sioc/ns#Item'], 'content:encoded': [{'@value': 'Félagsvísindastofnun \xa0\n\n\t\tGimli - Sæmundargötu 10\xa0\n\n\t\t102 Reykjavík\n\n\t\tKt. 600169-2039\n\n\t\tHér erum við\n\n\t\tOpið: 9:00-16:00\n'}], 'http://purl.org/dc/terms/title': [{'@value': 'Fótur 1'}]}, {'@id': '/ISKOS1983', '@type': ['http://xmlns.com/foaf/0.1/Document'], 'content:encoded': [{'@value': 'Gagnaskrá og tilheyrandi skjöl eru að finna hér að neðan. Að auki er hlekkur á gagnvirka greiningu á netinu, í NESSTAR WebView, þar sem auðvelt er að skoða lýsandi tölfræði o.fl.\n\n\t\t\t\t\xa0\n\n\t\t\t\t\xa0\n\n\t\t\t\tDOI númer\n\n\t\t\t\t10.34881/1.00001\n\n\t\t\t\tÚtgáfa gagnaskrár\n\n\t\t\t\t3.0.0\n\n\t\t\t\tHöfundur/höfundar\n\n\n\t\t\t\t\t\tHarðarson, Ólafur Þórður (Stjórnmálafræðideild, Háskóli Íslands)\n\n\t\t\t\t\t\tFélagsvísindastofnun, Háskóli Íslands\n\n\n\t\t\t\tÚtgáfudagur\n\n\t\t\t\t2013-09-10\n\n\t\t\t\tUmsjón gagnasöfnunar\n\n\t\t\t\tFélagsvísindastofnun, Háskóli Íslands\xa0\n\n\t\t\t\tFjármögnun\n\n\n\t\t\t\t\t\tRannsóknasjóður Íslands (Icelandic Research Fund; RANNÍS)\n\n\t\t\t\t\t\tRannsóknasjóður Háskóla Íslands (University of Iceland Research Fund)\n\n\t\t\t\t\t\tÖryggismálanefnd (Icelandic Commission on Security and International Affairs)\n\n\t\t\t\t\t\tNordic cooperation committee for research on international relations (NORDSAM)\n\n\n\t\t\t\tLýsing\n\n\t\t\t\tÍslenska kosningarannsóknin er viðamikil rannsókn þar sem lagðar eru fyrir spurningar um kosninga- og stjórnmálahegðun íslenskra kjósenda. Meðal rannsóknarefna eru til dæmis kosningahegðun, afstaða kjósenda til stjórnmálaflokka, afstaða til lýðræðis, hvað kjósendur telja vera mikilvægustu verkefnin á vettvangi stjórnmála, þátttöku þeirra í prófkjörum og margvísleg önnur málefni á vettvangi stjórnmála. Íslenska kosningarannsóknin er hluti af Nordic Electoral Democracy (NED) sem er norrænt samstarf um lýðræði og kosningar; Comparative Studies of Electoral Systems (CSES) og True European Voter (TEV) sem eru hvoru tveggja alþjóðalegt samstarf um kosningarannsóknir.\n\n\t\t\t\tTímabil gagnasöfnunar\n\n\t\t\t\t1983-05-13 / 1983-09-21\n\n\t\t\t\tLandssvæði\n\n\t\t\t\tIceland (IS)\n\n\t\t\t\tAðferð við úrtaksgerð\n\n\t\t\t\tÚr þjóðskrá var dregið einfalt líkindaúrtak einstaklinga á aldrinum 18 til 80 ára. Stærð úrtaks var 1.400 manns. Brúttósvarhlutfall var 71,6% og nettósvarhlutfall var 79,1%.\n\n\t\t\t\tRannsóknarsnið\n\n\t\t\t\tLongitudinal: Trend/Repeated cross-section\n\n\t\t\t\tForm gagnaöflunar\n\n\n\t\t\t\t\t\tSímakönnun (í kjölfar kosninga).\n\n\t\t\t\t\t\tViðtalskönnun (í kjölfar kosninga).\n\n\t\t\t\t\t\tPóstkönnun (í kjölfar kosninga).\n\n\n\t\t\t\tUpplýsingar um gagnaskrá\n\n\n\t\t\t\t\t\tUnit Type: Individual\n\n\t\t\t\t\t\tNumber of Units: 1003\n\n\t\t\t\t\t\tNumber of Variables: 98\n\n\t\t\t\t\t\tType of Data: Survey data\n\n\t\t\t\t\t\tFile Name: icenes_1983_opin_adgangur_islenska_3utg.sav\n\n\t\t\t\t\t\tFile Format: SPSS (Icelandic)\n\n\t\t\t\t\t\tFile Size: 183 KB\n\n\n\t\t\t\tAthugasemdir\n\n\t\t\t\tGagnaskrá er til á íslensku og ensku.\n\n\t\t\t\tAðgangur\n\n\t\t\t\tOpinn aðgangur\n\n\t\t\t\tAfnotaleyfi\n\n\t\t\t\tCC BY-NC 4.0\n\n\t\n\tGagnaskrá og skjöl:\nGagnaskrá (SPSS, íslenska)\nUpplýsingar um úrtak og framkvæmd\nSpurningalisti\nKóðunarbók\n Athugið að styðjast við kóðunarbókina þegar unnið er með gagnaskrána.\n\n\tGagnvirk greining á netinu:\nÍslenska kosningarannsóknin 1983 í NESSTAR WebView\n\n<!--/--><![CDATA[/ ><!--/\n\n.table-stribed td{\n border-bottom: 1px solid #AAAAAA !important;\n}\ntable.table.table-striped tr td {\n vertical-align: top;\n}\ntable.table.table-striped {\n border: none;\n}\n.table-striped tr {\n border-bottom: 1px solid #dddddd;\n}\ntable.table.table-striped tr:nth-child(even) {\n background-color: #f9f9f9;\n}\t\n/--><!]]>/\n\n<![CDATA[/ ><!--/\n\n.table-stribed td{\n border-bottom: 1px solid #AAAAAA !important;\n}\ntable.table.table-striped tr td {\n vertical-align: top;\n}\ntable.table.table-striped {\n border: none;\n}\n.table-striped tr {\n border-bottom: 1px solid #dddddd;\n}\ntable.table.table-striped tr:nth-child(even) {\n background-color: #f9f9f9;\n}\ntable.table.table-striped {\n font-size: 12px;\n}\t\n/--><!]]>/\n\n'}]}]

Google test (rich results test)

image

ajaunsen commented 3 years ago

It is not clear (to me) how I can trace the ACTION items suggested in this ticket in response to the reported discrepancies.

kitchenprinzessin3880 commented 3 years ago

@ajaunsen i created a list of actions for @huberrob before leaving the project. @huberrob can you please an issue for each of the action items above?

huberrob commented 3 years ago

OK.. I have created some issues see list..

huberrob commented 3 years ago

ACTION 1: https://github.com/pangaea-data-publisher/fuji/issues/84 ACTION 2: https://github.com/pangaea-data-publisher/fuji/issues/83 ACTION 3: https://github.com/pangaea-data-publisher/fuji/issues/82 ACTION 4: https://github.com/pangaea-data-publisher/fuji/issues/81 ACTION 5: https://github.com/pangaea-data-publisher/fuji/issues/80

ajaunsen commented 3 years ago

Yes, I have seen them, thx. Hopefully they will be addressed in the near future :)

ajaunsen commented 3 years ago

Regarding this issue, I agree there are several examples of non-related metadata passing as structured metadata. But, how does FUJI determine this, ie. the relevance of the metadata to the dataset in such cases?

(Dataset 85) http://tun.fi/JX.1099681 FUJI can identify RDFa included on the landing page, but the information included in not related to metadata of dataset, therefore the metric is failed, see below: 'rdfa': [{'@id': '_:N2653f5de968b4c9da2cc3be78b685afc', 'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#alert'}]}, {'@id': '_:N37d7ac4315524da18ece684377ec22db', 'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#dialog'}]}]}

It seems like Mark's tool assume this test is 'pass' as long as rdfa included in the data page.

ACTION 3 - Elaborate the exception message when processing rdfa metadata, e.g., 'rdfa found but not related to metadata of a data object'

huberrob commented 3 years ago

http://tun.fi/JX.1099681 is interesting and nicely shows some pitfalls for shallow testing:

Probably Mark's tool (or whatever library he is using in the background) is identifying the document a XHTML since it makes use of the 'role' attribute in connection with valid XHTML vocabulary terms (here: alert).

  1. https://www.w3.org/TR/2010/NOTE-xhtml-role-20101216/
  2. https://www.w3.org/TR/2010/NOTE-xhtml-role-20101216/

Therefore identifying RDFa in this example is not completely incorrect, since also the RDFa validator https://www.w3.org/2012/pyRdfa/ validates it and gives the following output :

@prefix ns1: <http://www.w3.org/1999/xhtml/vocab#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

[] ns1:role ns1:alert .

But of course there is no parsable content in terms of metadata for FAIR testing it just contains this role="alert" in one of the HTML elements. It's somehow an empty RDFa ;)

huberrob commented 3 years ago

I forgot to paste the relevant section in the HTML source of the landing page:

<noscript>
 <div class="alert alert-warning" role="alert" style="margin-bottom: 0">
<p>Suomen Lajitietokeskuksen nettisivu tarvitsee javascriptin toimiakseen. Tarkista selaimesi asetukset.</p>
<p>To use Finnish Biodiversity Information Facilities website, please enable JavaScript.</p>
</div>
 </noscript>
huberrob commented 3 years ago

Since all actions have been resolved or moved to other issues I close this issue now