machinelearningZH / ogd_ai-analyzer

Automagically do a deep check of metadata quality of your DCAT OGD offering.
MIT License
5 stars 1 forks source link
ai dcat-ap dcat-ap-ch llm llms machine-learning ogd openai opendata python streamlit

🦄 OGD Auto AI Analyzer

Almost automatically analyze the quality of a DCAT metadata catalog. With a little help from ✨ AI...

GitHub License PyPI - Python GitHub Stars GitHub Issues GitHub Issues Current Version linting - Ruff

Contents - [Usage](#usage) - [What does the code do?](#what-does-the-code-do) - [What exactly do we check?](#what-exactly-do-we-check) - [Why check metadata?](#background-why-check-metadata) - [How to fix this?](#how-to-fix-this) - [Project team](#project-team) - [Feedback and contributing](#feedback-and-contributing)

Usage

[!Note] The notebook is set up as a Quarto file. You don't need to use Quarto. You can simply run the notebook as is and look at the results. However, we encourage you to try it out with Quarto. The results will be much more shareable, e.g. to a non technical audience, that doesn't want or need to see code. Simply install Quarto, add an extension to your IDE and convert the notebook to an HTML oder PDF file.

What does the code do?

We carry out a thorough metadata analysis and quality check using our own OGD metadata catalog of the Canton of Zurich as an example.

This project is based on two simple ideas:

We set up the code to perform most of the checks automatically. It should be easy to adapt this notebook to other data catalogues that conform to the DCAT-AP CH standard.

The two notebooks produces the following outputs:

[!Important] At the risk of stating the obvious: By using the code parts for the LLM-based analysis you send data to a third-party provider namely OpenAI. Therefore only use non-sensitive data. Again, stating the obvious: LLMs make errors. They regularly hallucinate, make things up, and get things wrong. They often do so in subtle, non-obvious ways, that may be hard to detect. This app is meant to be used as an assistive system that makes suggestions. It only yields a draft of an analyis, that you always should double-check.

What exactly do we check?

We focus on the following points:

These checks encompass the metadata at both the dataset and distribution levels.

With the second notebook you get an in depth analysis of each datasets title and description. We prompt an ✨ LLM to assess if title and description explain clearly and in detail:

You also get a score for each dataset from 1 (least informative) to 5 (most informative). The scoring is as follows:

Background: Why check metadata?

Metadata is essential for data users. Only with an understanding of context, methodology, content, and quality can they fully utilize the data. Creating good metadata requires time and effort. Unfortunately, not all metadata meets sufficient quality standards. We observe issues both in our catalog and others, such as opendata.swiss.

Swiss OGD offerings follow the DCAT-AP CH standard, the «Swiss Application Profile for Data Portals and Catalogues». While DCAT is beneficial and widely adopted, it can be easily «hacked».

These are real issues. If you look at OGD catalogues you'll easily find many of these examples and also quite a few datasets that perfectly adhere to the standard but are completely broken.

[!Note] These problems are not the «fault» of DCAT. The standard is a sincere recommendation, but it cannot ensure that every entry is meaningful. This responsibility lies with us as data stewards and publishers.

How to fix this?

As of the time of writing, our own OGD lists ~800 datasets and opendata.swiss ~12,000 datasets. Manually checking each dataset for metadata quality issues is unrealistic. One way to improve this is by developing automatic procedures to programmatically check and highlight metadata issues. This project suggests a template and hopefully some fresh ideas to achieve this.

Project team

This is a project of Team Data of the Statistical Office of the Canton of Zurich. Responsible: Laure Stadler and Patrick Arnecke. Many thanks go to Corinna Grobe and our former colleague Adrian Rupp. Merci! ❤️

Feedback and contributing

We would love to hear from you. Please share your feedback and let us know how you use the code. You can write an email or share your ideas by opening an issue or a pull requests.

Please note that we use Ruff for linting and code formatting with default settings.