Closed kitchenprinzessin3880 closed 3 years ago
For one; I have observed that all (3) figshare based repositories do not score well in F-UJI. In fact they do not score on any metric (obviously apart from metric 1, FsF-F1-01D) when evaluating using the URL based reference to the dataset (ie. avoiding Datacite provided metadata).
[ ] (Dataset 136-1) https://data.dtu.dk/articles/Data_for_the_paper_A_dual_reporter_system_for_investigating_and_optimizing_translation_and_folding_in_E_coli_/10265420 --> ACTION 2
[ ] (Dataset 68) https://usn.figshare.com/articles/Data_Journal_EvolutionaryApplications_RData/7770785 --> ACTION 2
[ ] (Dataset-26) https://su.figshare.com/articles/Data_for_Does_historical_land_use_affect_the_regional_distribution_of_fleshy-fruited_woody_plants_Arnell_et_al_2019_/10318046 --> ACTION 2
This is in sharp contrast from the FDS (M Wilkinson) tool, in which these datasets scored 16 out of 22 indicators in July. However, I have checked their scoring again now and find that they now score 8 out of 22. Something seems to have changed in figshare's service and the difference is not dramatic as it first seemed comparing to evaluations in July.
The following datasets remain with large discrepancies;
This scores 13 out of 22 indicators in the FDS tool, while only metric 1 (FsF-F1-01D) pass in F-UJI. The evaluation is visible here: https://fair-latest.etais.ee/evaluation/6732
Similarly, 4 more that score 10 out of 22 indicators in the FDS tool, while only passing metric 1 (FsF-F1-01D) in F-UJI:
[ ] (Dataset 85) http://tun.fi/JX.1099681 --> ACTION 3
[ ] (Dataset 84) https://ortus.rtu.lv/science/en/datamodule/243 --> ACTION 4
[ ] (Dataset 55) https://www.smhi.se/data/utforskaren-oppna-data/vattendrag-svar2012-datamangd --> ACTION 1
[ ] (Dataset 114) https://metabolicatlas.org/explore/gem-browser/human1/reaction/GLYLYSCYSt --> ACTION 1
[ ] (Dataset 136-2) https://data.dtu.dk/articles/E_coli_chemical_tolerance_raw_resequencing_reads/10289417 -->ACTION 2
[ ] (Dataset 92) http://fel.hi.is/ISKOS1983 --> ACTION 5
The following datasets remain with large discrepancies; (Dataset 39) https://www.proteinatlas.org/ENSG00000110651-CD81/cell @ajaunsen This was discussed in this issue before @ https://github.com/pangaea-data-publisher/fuji/issues/62 F-UJI accepts resource types recommended by google dataset search (e.g., Dataset, Collection, etc). DataRecord is introduced by bioschema and it still under draft. Testing the same identifier using https://search.google.com/structured-data/testing-tool produced an exception, see below:
(Dataset 114) https://metabolicatlas.org/explore/gem-browser/human1/reaction/GLYLYSCYSt Schema.org @type = Website, therefore F-UJI ignores it.
(Dataset 55) https://www.smhi.se/data/utforskaren-oppna-data/vattendrag-svar2012-datamangd Schema.org @type= BreadcrumbList,, therefore F-UJI ignores it.
On the other hand, we do not treat identification of 'Resource Type' in DataCite metadata or Dublin Core in the same (strict) way. I think we should harmonize this and define a list of 'scientific resource types' an mappings for DC, schema.org and DataCite which we then use to raise a warning (1st step) after metadata collection.
@huberrob - the issue is not only about resource types supported but also their associated metadata properties. datacite and dc do not have different sets of metadata elements for different resource types (there is only one schema for all types), whereas in the case of schema.org the properties may vary according to the sub-types of CreativeWork. mapping between these schemes have been done by RDA group, i will find out the link..
Hi @kitchenprinzessin3880 ,
yes that's true.. on the other hand our mapping (query) would only find the correct metadata properties which are contained in the mapping. If they are in the right position and have the right name we should take them. But of course then we should raise a warning.. We should do the same thing for DC, DCAT and DataCite etc metadata records which also can(should) contain a resource type indication. IF these records do not indicate a resource type or the wrong one we also should raise a warning. I can create a ticket on this if you agree..
ACTION 1 - From my view the best way forward is to validate the types using schema.org context (https://schema.org/docs/jsonldcontext.json) and output a warning in the debug message if the type specified is not dataset/collection or not exists in standard schema.org. @https://github.com/pangaea-data-publisher/fuji/issues/45 i addressed the need for metadata validation sometime ago.
content-type of the response (data page) is NULL. according to https://tools.ietf.org/html/rfc7231#section-3.1.1.5,
A sender that generates a message containing a payload body SHOULD generate a Content-Type header field in that message unless the intended media type of the enclosed representation is unknown to the sender. If a Content-Type header field is not present, the recipient MAY either assume a media type of "application/octet-stream" ([RFC2046], Section 4.5.1) or examine the data to determine its type.
ACTION 2 - suggest redesigning request_helper.py class by using apache tika to infer the content type; see example of snippet below:
from tika import parser
parsedFile = parser.from_file(self.request_url )
status = parsedFile.get("status")
tika_content_types = parsedFile.get("metadata").get('Content-Type')
(Dataset 85) http://tun.fi/JX.1099681
FUJI can identify RDFa included on the landing page, but the information included in not related to metadata of dataset, therefore the metric is failed, see below:
'rdfa': [{'@id': '_:N2653f5de968b4c9da2cc3be78b685afc', 'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#alert'}]}, {'@id': '_:N37d7ac4315524da18ece684377ec22db', 'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#dialog'}]}]}
It seems like Mark's tool assume this test is 'pass' as long as rdfa included in the data page.
ACTION 3 - Elaborate the exception message when processing rdfa metadata, e.g., 'rdfa found but not related to metadata of a data object'
(Dataset 84) https://ortus.rtu.lv/science/en/datamodule/243
Metadata document is included using link relation type:
<link rel="alternate" type="application/ld+json" href="http://scidata.vitk.lv/dataset/243.jsonld"/>
ACTION 4 - extend the class 'MetaDataCollectorRdf' to handle jsonld reserialization of rdf-based metadata. use package
pip install rdflib-jsonld
(Dataset 92) http://fel.hi.is/ISKOS1983
rdfa is detected by F-UJI, it seems like the information included is not about data (see figure and code snippet below) ACTION 5 - check the rdfa parsing in fair_check.py.
[{'@id': '/fotur_2', '@type': ['http://rdfs.org/sioc/ns#Item', 'http://xmlns.com/foaf/0.1/Document'], 'content:encoded': [{'@value': 'Hafðu samband\n\n\t\tSími: 525 4545\n\n\t\tNetfang: felagsvisindastofnun@hi.is\n'}], 'http://purl.org/dc/terms/title': [{'@value': 'Fótur 2'}]}, {'@id': '/fotur_4', '@type': ['http://rdfs.org/sioc/ns#Item', 'http://xmlns.com/foaf/0.1/Document'], 'content:encoded': [{'@value': 'Samfélagsmiðlar\nFélagsvísindastofnun á Facebook\nHáskóli Íslands á Twitter\nRSS veita HÍ\n'}], 'http://purl.org/dc/terms/title': [{'@value': 'Fótur 4'}]}, {'@id': '/fotur_3', '@type': ['http://rdfs.org/sioc/ns#Item', 'http://xmlns.com/foaf/0.1/Document'], 'content:encoded': [{'@value': 'Opnunartímar bygginga\nAðalbygging 07:30-17:00\nHáskólatorg 07:30-22:00\nAllir opnunartímar\n'}], 'http://purl.org/dc/terms/title': [{'@value': 'Fótur 3'}]}, {'@id': '/fotur_1', '@type': ['http://xmlns.com/foaf/0.1/Document', 'http://rdfs.org/sioc/ns#Item'], 'content:encoded': [{'@value': 'Félagsvísindastofnun \xa0\n\n\t\tGimli - Sæmundargötu 10\xa0\n\n\t\t102 Reykjavík\n\n\t\tKt. 600169-2039\n\n\t\tHér erum við\n\n\t\tOpið: 9:00-16:00\n'}], 'http://purl.org/dc/terms/title': [{'@value': 'Fótur 1'}]}, {'@id': '/ISKOS1983', '@type': ['http://xmlns.com/foaf/0.1/Document'], 'content:encoded': [{'@value': 'Gagnaskrá og tilheyrandi skjöl eru að finna hér að neðan. Að auki er hlekkur á gagnvirka greiningu á netinu, í NESSTAR WebView, þar sem auðvelt er að skoða lýsandi tölfræði o.fl.\n\n\t\t\t\t\xa0\n\n\t\t\t\t\xa0\n\n\t\t\t\tDOI númer\n\n\t\t\t\t10.34881/1.00001\n\n\t\t\t\tÚtgáfa gagnaskrár\n\n\t\t\t\t3.0.0\n\n\t\t\t\tHöfundur/höfundar\n\n\n\t\t\t\t\t\tHarðarson, Ólafur Þórður (Stjórnmálafræðideild, Háskóli Íslands)\n\n\t\t\t\t\t\tFélagsvísindastofnun, Háskóli Íslands\n\n\n\t\t\t\tÚtgáfudagur\n\n\t\t\t\t2013-09-10\n\n\t\t\t\tUmsjón gagnasöfnunar\n\n\t\t\t\tFélagsvísindastofnun, Háskóli Íslands\xa0\n\n\t\t\t\tFjármögnun\n\n\n\t\t\t\t\t\tRannsóknasjóður Íslands (Icelandic Research Fund; RANNÍS)\n\n\t\t\t\t\t\tRannsóknasjóður Háskóla Íslands (University of Iceland Research Fund)\n\n\t\t\t\t\t\tÖryggismálanefnd (Icelandic Commission on Security and International Affairs)\n\n\t\t\t\t\t\tNordic cooperation committee for research on international relations (NORDSAM)\n\n\n\t\t\t\tLýsing\n\n\t\t\t\tÍslenska kosningarannsóknin er viðamikil rannsókn þar sem lagðar eru fyrir spurningar um kosninga- og stjórnmálahegðun íslenskra kjósenda. Meðal rannsóknarefna eru til dæmis kosningahegðun, afstaða kjósenda til stjórnmálaflokka, afstaða til lýðræðis, hvað kjósendur telja vera mikilvægustu verkefnin á vettvangi stjórnmála, þátttöku þeirra í prófkjörum og margvísleg önnur málefni á vettvangi stjórnmála. Íslenska kosningarannsóknin er hluti af Nordic Electoral Democracy (NED) sem er norrænt samstarf um lýðræði og kosningar; Comparative Studies of Electoral Systems (CSES) og True European Voter (TEV) sem eru hvoru tveggja alþjóðalegt samstarf um kosningarannsóknir.\n\n\t\t\t\tTímabil gagnasöfnunar\n\n\t\t\t\t1983-05-13 / 1983-09-21\n\n\t\t\t\tLandssvæði\n\n\t\t\t\tIceland (IS)\n\n\t\t\t\tAðferð við úrtaksgerð\n\n\t\t\t\tÚr þjóðskrá var dregið einfalt líkindaúrtak einstaklinga á aldrinum 18 til 80 ára. Stærð úrtaks var 1.400 manns. Brúttósvarhlutfall var 71,6% og nettósvarhlutfall var 79,1%.\n\n\t\t\t\tRannsóknarsnið\n\n\t\t\t\tLongitudinal: Trend/Repeated cross-section\n\n\t\t\t\tForm gagnaöflunar\n\n\n\t\t\t\t\t\tSímakönnun (í kjölfar kosninga).\n\n\t\t\t\t\t\tViðtalskönnun (í kjölfar kosninga).\n\n\t\t\t\t\t\tPóstkönnun (í kjölfar kosninga).\n\n\n\t\t\t\tUpplýsingar um gagnaskrá\n\n\n\t\t\t\t\t\tUnit Type: Individual\n\n\t\t\t\t\t\tNumber of Units: 1003\n\n\t\t\t\t\t\tNumber of Variables: 98\n\n\t\t\t\t\t\tType of Data: Survey data\n\n\t\t\t\t\t\tFile Name: icenes_1983_opin_adgangur_islenska_3utg.sav\n\n\t\t\t\t\t\tFile Format: SPSS (Icelandic)\n\n\t\t\t\t\t\tFile Size: 183 KB\n\n\n\t\t\t\tAthugasemdir\n\n\t\t\t\tGagnaskrá er til á íslensku og ensku.\n\n\t\t\t\tAðgangur\n\n\t\t\t\tOpinn aðgangur\n\n\t\t\t\tAfnotaleyfi\n\n\t\t\t\tCC BY-NC 4.0\n\n\t\n\tGagnaskrá og skjöl:\nGagnaskrá (SPSS, íslenska)\nUpplýsingar um úrtak og framkvæmd\nSpurningalisti\nKóðunarbók\n Athugið að styðjast við kóðunarbókina þegar unnið er með gagnaskrána.\n\n\tGagnvirk greining á netinu:\nÍslenska kosningarannsóknin 1983 í NESSTAR WebView\n\n<!--/--><![CDATA[/ ><!--/\n\n.table-stribed td{\n border-bottom: 1px solid #AAAAAA !important;\n}\ntable.table.table-striped tr td {\n vertical-align: top;\n}\ntable.table.table-striped {\n border: none;\n}\n.table-striped tr {\n border-bottom: 1px solid #dddddd;\n}\ntable.table.table-striped tr:nth-child(even) {\n background-color: #f9f9f9;\n}\t\n/--><!]]>/\n\n<![CDATA[/ ><!--/\n\n.table-stribed td{\n border-bottom: 1px solid #AAAAAA !important;\n}\ntable.table.table-striped tr td {\n vertical-align: top;\n}\ntable.table.table-striped {\n border: none;\n}\n.table-striped tr {\n border-bottom: 1px solid #dddddd;\n}\ntable.table.table-striped tr:nth-child(even) {\n background-color: #f9f9f9;\n}\ntable.table.table-striped {\n font-size: 12px;\n}\t\n/--><!]]>/\n\n'}]}]
Google test (rich results test)
It is not clear (to me) how I can trace the ACTION items suggested in this ticket in response to the reported discrepancies.
@ajaunsen i created a list of actions for @huberrob before leaving the project. @huberrob can you please an issue for each of the action items above?
OK.. I have created some issues see list..
ACTION 1: https://github.com/pangaea-data-publisher/fuji/issues/84 ACTION 2: https://github.com/pangaea-data-publisher/fuji/issues/83 ACTION 3: https://github.com/pangaea-data-publisher/fuji/issues/82 ACTION 4: https://github.com/pangaea-data-publisher/fuji/issues/81 ACTION 5: https://github.com/pangaea-data-publisher/fuji/issues/80
Yes, I have seen them, thx. Hopefully they will be addressed in the near future :)
Regarding this issue, I agree there are several examples of non-related metadata passing as structured metadata. But, how does FUJI determine this, ie. the relevance of the metadata to the dataset in such cases?
(Dataset 85) http://tun.fi/JX.1099681 FUJI can identify RDFa included on the landing page, but the information included in not related to metadata of dataset, therefore the metric is failed, see below:
'rdfa': [{'@id': '_:N2653f5de968b4c9da2cc3be78b685afc', 'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#alert'}]}, {'@id': '_:N37d7ac4315524da18ece684377ec22db', 'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#dialog'}]}]}
It seems like Mark's tool assume this test is 'pass' as long as rdfa included in the data page.
ACTION 3 - Elaborate the exception message when processing rdfa metadata, e.g., 'rdfa found but not related to metadata of a data object'
http://tun.fi/JX.1099681 is interesting and nicely shows some pitfalls for shallow testing:
Probably Mark's tool (or whatever library he is using in the background) is identifying the document a XHTML since it makes use of the 'role' attribute in connection with valid XHTML vocabulary terms (here: alert).
Therefore identifying RDFa in this example is not completely incorrect, since also the RDFa validator https://www.w3.org/2012/pyRdfa/ validates it and gives the following output :
@prefix ns1: <http://www.w3.org/1999/xhtml/vocab#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
[] ns1:role ns1:alert .
But of course there is no parsable content in terms of metadata for FAIR testing it just contains this role="alert" in one of the HTML elements. It's somehow an empty RDFa ;)
I forgot to paste the relevant section in the HTML source of the landing page:
<noscript>
<div class="alert alert-warning" role="alert" style="margin-bottom: 0">
<p>Suomen Lajitietokeskuksen nettisivu tarvitsee javascriptin toimiakseen. Tarkista selaimesi asetukset.</p>
<p>To use Finnish Biodiversity Information Facilities website, please enable JavaScript.</p>
</div>
</noscript>
Since all actions have been resolved or moved to other issues I close this issue now
@ajaunsen to provide the list of identifiers with large differences in terms of scores generated by the tool.