routetopa / tet

TET - Transparency-Enhancing Toolset
http://routetopa.eu
GNU Affero General Public License v3.0
3 stars 1 forks source link

Dataset anomaly detection #51

Closed arekstasiewicz closed 7 years ago

arekstasiewicz commented 7 years ago

User Interface + API for dataset anomaly detection

serahkiburu commented 7 years ago

Greetings @arekstasiewicz, have you heard about Good Tables before? https://github.com/frictionlessdata/goodtables-py

arekstasiewicz commented 7 years ago

Hello @callmealien, thanks for the link. In our case anomaly detection is related to values itself, not just the validation.

The overall 'dashboard' will look like this:

image

pwalsh commented 7 years ago

hi @arekstasiewicz it is quite trivial to write a custom processor for goodtables to do "anomaly detection" for values (deviation from mean, etc. etc.).

However, it is hard to get anomaly detection right in a way that is generally application to open data (domain knowledge is required to understand what an anomaly could be, for a particular dataset or theme).

goodtables has been designed against problems that are persistent in published open data, chronically impair reuse of open data, yet are hard for publishers to detect with current publication flows to open data portals. I really recommend you take a deeper look ;)

mohadelrezk commented 7 years ago

Hi @pwalsh we are not saying that goodtables is trivial and it can't do the needed task, we are using LOF algorithm for anomaly detection with a couple of data validation and transformation stages to enable quantitative analysis over qualitative data. We will consider your goodtable project for validation despite that we already have a simpler data validation service that fulfill our needs. I also agree with you regarding the domain knowledge needed for anomaly decisions, but in our case a false alarm of anomaly is better than a false alarm of normal and the final decision will be taken by data owners or data users which are in most cases domain experts. Thanks for your valuable input :)

pwalsh commented 7 years ago

Hi @mohadelrezk ok, great! Is the simpler data validation service you refer to available as open source? I'd love to take a look.

arekstasiewicz commented 7 years ago

work continues in #89