italia / ckanext-dcatapit

CKAN extension for the Italian Open Data Portals (DCAT_AP-IT)
GNU Affero General Public License v3.0
1 stars 1 forks source link

RDF Serialization - issue with the dct:format property value #3

Open giorgialodi opened 4 years ago

giorgialodi commented 4 years ago

in the RDF serialization of a dataset, the dct:format property may assume the value OP_DATPRO even if the source catalog correctly indicates the format using the EU controlled vocabulary, as requested by the DCAT-AP_IT specs. This does not happen if the format of the distribution is CSV for instance. It seems happing during the harvesting phase and for specific formats (e.g., all those related to RDF serializations such as RDF_XML, RDF_TURTLE, RDF_N_TRIPLES, etc.)

Example: Source Catalogue: Linked Data Platform with metadata compliant with DCAT-AP_IT

http://dati.beniculturali.it/resource/Distribuzione/complessoArchivistico-GGASI-nt a dcatapit:Distribution, dcat:Distribution ; dct:description "Distribuzione in formato N triples del dataset complessoArchivistico-GGASI " ; dct:format http://publications.europa.eu/resource/authority/file-type/OP_DATPRO ; dct:license https://w3id.org/italia/controlled-vocabulary/licences/C1_Unknown, "https://creativecommons.org/licenses/by-nc/2.5/it/legalcode/" ; dct:title "Distribuzione in formato N triples del dataset complessoArchivistico-GGASI" ; dcat:downloadURL http://dati.san.beniculturali.it/dataset/nt/complessoArchivistico-GGASI.nt

In this case the format is OP_DATPRO while in the source catalogue is correctly materialized with the following URI: http://publications.europa.eu/resource/authority/file-type/RDF_N_TRIPLES

It may be a problem of a limited set of format_mapping values https://github.com/geosolutions-it/ckanext-dcatapit/blob/master/ckanext/dcatapit/dcat/profiles.py#L76 ?

In any case, the expected behaviour is that if the source correctly includes the format using the requested controlled vocabulary, no format mapping should be applied. We should simply use what is included in the source catalogue.

giorgialodi commented 4 years ago

the issue seems more complex than expected. The issue involves the format (CKAN metadata) and distribution_format (we introduced for DCAT-AP_IT profile) fields.

We have three possible cases:

  1. people upload metadata through web form --> no issue since the format and distribution_format are materialized.

  2. harvesting of a source which is not compliant with the DCAT-AP_IT profile --> there is no distribution_format but only the format. If the distribution_format is not available, in the serialization phase the code will call the format_mapping object which is very limited in the mapping. Hence, for the most common formats the mapping works, for all the others the "OP_DATAPRO" is used; in the visualization phase an empty field is visualized, since in visualization the code uses distribution_format;

  3. harvesting of a source that is compliant with the profile --> it seems, from the code, that the distribution_format is never materialized even if the data source uses the correct format from the EU controlled vocabulary. Only the format field is materialized. The result is the same as the case 2; that is, format_mapping object is called once again. For common formats (e.g., CSV, JSON) everything is fine, for all the others "OP_DATAPRO" is used. This is why it happens what I reported above with the CSV format.

The code involved should be the following:

in ckanext-dcatapit/ckanext/dcatapit/dcat/profiles.py Line 326 resource_dict[key] = value

distribution_format is never materialized. In general we have many OP_DATAPRO in the central registry because we harvest from RDF serializations of PAs that introduce this error. Possible solution to fix the issue

We need to materialize distribution_format. In the case 3. we need to take the last part of URI of the EU controlled vocabulary and set distribution_format. In the serialization phase we will then use it to create the right node of the graph, and in visualization we will visualize it. Case 2 is more complex. If the data source is not compliant we do not have any EU controlled vocabulary reference, we just have CKAN's format (also very strange onces: I saw "geo json" or name of pdf files as formats!!!!). In this case, format_mapping should be applied. However, since it is limited we need to extend it to cover all the remaining formats. In case of strange things included by PAs we will use OP_DATAPRO. We may create a mapping file starting from these CKAN's formats https://github.com/ckan/ckan/blob/master/ckan/config/resource_formats.json. We will do this mapping once we get the data from the harvesting so that during serialization and visualization the distribution_format is anyway materialized. We should verify the feasibility of this solution.

Alternatively, another solution can be to dynamically derive the distribution_format from CKAN's format every time the serialization and the visualization are to be executed. BTW: in the CKAN's filters it should be better visualizing the DCAT-AP_IT formats and not the current mess of CKAN which allows anyone to include a text free format if not included in the JSON I pointed out above.