This repository aims to collect and share an updated list of Italian public open data sources as complete as possible in both human- and machine-readable formats.
It is also the official repository of harvesting sources for CKAN-IT when configured as open data harvester, ie. within the Piattaforma Digitale Nazionale Dati (PDND) - previously DAF. CKAN-IT provides everything you need to run CKAN plus a set of extensions for supporting Italian open data in a set of Docker images. If you are interested in an open data catalogue up and running in minutes, see italia/ckan-it.
There are two entities: an organization and a harvesting source. Both are described by a json file compliant with the schemas provided.
In orgs/
folder there are all organizations. In sources/
folder, all harvesting sources.
Before importing or just after exporting you should check compliance with schemas in schemas/
folder using two provided scripts.
You must have Python3 and the jsonschema module installed in your system.
bash validate_orgs.sh
bash validate_sources.sh
If you want to run them in a virtual environment, a Pipfile is provided to use with pipenv.
If you have pipenv, just run pipenv shell
and then pipenv install
before launch the validation scripts.
You can combine all entities from orgs/
and sources/
folders in single json files using export_all.py
script that creates two files in dist/
folder:
index.json
with a list of organizations and sources per organization;nested/index.json
with a deeper nested object (see #6 for details).To validate them you can use the provided validate_all.sh
script.
Organizations:
Harvesting sources:
#
for the url fieldThe following different types of harvesters are currently supported.
CKAN (ckan
): this type of harvesting enables the metadata exchange between two instances of standard CKAN with no specific extensions enabled, it uses the APIs version 3 offered by CKAN and it can be used when both instances do not comply with any specific national standards.
CSW server (multilang) (multilang
): this type of harvesting imports geospatial metadata made available through the CSW protocol, the multilang extension manages multilingual fields for datasets, groups, tags and organizations.
Generic DCAT RDF Harvester (dcat_rdf
): this type of harvestings allows you to collect metadata that are described using the RDF (Resource Description Framework) Web standard, following the DCAT specifications. It is used in order to harvest data sources that are compliant with DCAT-AP_IT, an extension we built upon the CKAN's DCAT RDF harvester.
DCAT JSON Harvester (dcat_json
): this type of harvesting is used when data.json
metadata files are provided by data sources. The data.json is an API endpoint for describing open data catalogs by leveraging the U.S. Project Open Data metadata schema standards and usually data management systems like Socrata and DKAN offer such an endpoint. Therefore, this type of harvest can be used in all those cases in which Socrata or DKAN platforms are used at data sources. This type of harvesting comes with the earlier mentioned CKAN's DCAT RDF harvester.
CKAN Harvester for DCATAPIT (CKAN-DCATAPIT
): this type of harvesting enables the metadata exchange between two instances of standard CKAN, dealing with some requirements of the DCAT-AP_IT specifications (e.g., mapping on the 13 data themes and on licences controlled vocabulary). This harvesting should be used when a DCAT-AP_IT compliant catalog wants to harvest a CKAN catalog of a data source that is not compliant with DCAT-AP_IT.
DCAT-AP_IT CSW Harvester (DCAT_AP-IT CSW Harvester
): this type of harvesting is used to collect metadata made available through the OGC CSW protocol. It is built upon the CKAN's spatial extension, and inherits all of its functionalities. This harvester allows you to harvest DCAT-AP_IT dataset fields from the ISO metadata. It is useful wben dealing with harvesting of open geospatial data.
Install, setup and run CKAN from the official repository.
If you are ok with the official docker images provided, simply run them setting the environment variable CKAN_HARVEST="true"
.
Read more here for details.
Otherwise you can manually use the import_all.sh
script on a running instance of CKAN-IT.
bash import_all.sh APIKEY HOST
where APIKEY is the API key of your admin user (read more here for details) and HOST is the CKAN host (ie. localhost:5000
).Now follow these steps to run CKAN-IT harvesting process.
If you are running a CKAN-IT instance with many harvesting sources defined (ie. using the web interface), you can export them all using export_orgs.py
and export_sources.py
scripts. You must have Python3 installed or use pipenv with provided Pipfile.
Contributions are welcome. Feel free to open issues and submit a pull request at any time!