openml / openml-data

For tracking issues related to OpenML datasets
1 stars 1 forks source link

Import the outlier detection benchmark results from http://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/ #17

Open kno10 opened 7 years ago

kno10 commented 7 years ago

http://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/

Is a repository of outlier detection benchmark data and results.

Every data set comes with a downloadable "raw algorithm results" package containing the results of a few hundred (algorithm, parameter) combinations on these data sets, and there is a separate file with generated evaluation results, too. Alternatively, you could also import only the best-of results.

As mentioned in https://github.com/openml/java/issues/6 it would also be nice to have a "submit to OpenML" function in ELKI; and on the other hand, OpenML could use ELKI for evaluating outlier and clustering results (ELKI has 19 supervised evaluation measures for clustering, 9 internal evalution measures, with 3 different strategies for handling noise objects. For outlier evaluation, it has 4 measures + adjustment for chance for them, which yields 7 interesting measures). Except for the internal cluster evaluation measures (which may need O(n^2) memory and pairwise distances) they are all very fast to compute.

I don't have the capacity right now to do the integration myself; but I can assist e.g. with adapting the scripts used to generate above results. Or we can simply transfer the data as ascii for submission? From the API documentation, I do not understand how to format result data for submission. Are arbitrary file types or only ARFF allowed? How are evaluation results uploaded?

joaquinvanschoren commented 7 years ago

Hi Erich,

Thanks, looks super-interesting. A lot of them seem to originate from UCI. Are there some for which the original data is not on OpenML already? Am I correct that the data is the same as the UCI data but with an additional feature indicating the outliers? How do you establish the ground truth about which points are outliers?

For the integration, we'll need the following:

It seems you are focussing on the unsupervised setting. Should we consider the supervised setting as well (classification and regression)?

We're doing a hackathon in Munich on Feb 27 - Mar 3. Would you be able to help us during that week?

Cheers, Joaquin

kno10 commented 7 years ago

Most data sets are derived from UCI data sets. In order to obtain "outlier" data sets, authors often downsample all but the largest class, but few publish their exact resulting data set. For reproducibility, we made static samples, and uploaded them. Also most methods won't accept non-numeric attributes, so there are variants with different approaches of e.g. one-hot-encoding categorial attributes. All the data and experiments in that repository are for unsupervised outlier detection.

We don't have an ARFF writer yet. But if I have some spare time, I can try to first add ARFF output (which is likely of wide interest) and then provide you with a command line call for evaluation. But there is also the problem that some outlier methods will assign small values to outliers; while most will assign large values. So there is some metadata necessary to interpret results. The result files you have on the web page are easy: one row per run, one column per object with the outlier score only (this doesn't include above metadata - you have to know that e.g. FastABOD uses small scores for outliers; our evaluation tool has a regexp to recognize known methods based on the row label).

Supervised setting: ELKI only has very basic supervised methods such as kNN classification. Largely because Weka etc. were already rather good here, and it makes more sense to implement methods that aren't already somewhere else.

I don't know about end of February yet.

joaquinvanschoren commented 7 years ago

Hi Erich,

Would it be interesting to import these as well? https://arxiv.org/abs/1503.01158

Cheers, Joaquin

On Sun, Nov 27, 2016 at 12:05 AM Erich Schubert notifications@github.com wrote:

Most data sets are derived from UCI data sets. In order to obtain "outlier" data sets, authors often downsample all but the largest class, but few publish their exact resulting data set. For reproducibility, we made static samples, and uploaded them. Also most methods won't accept non-numeric attributes, so there are variants with different approaches of e.g. one-hot-encoding categorial attributes. All the data and experiments in that repository are for unsupervised outlier detection.

We don't have an ARFF writer yet. But if I have some spare time, I can try to first add ARFF output (which is likely of wide interest) and then provide you with a command line call for evaluation. But there is also the problem that some outlier methods will assign small values to outliers; while most will assign large values. So there is some metadata necessary to interpret results. The result files you have on the web page are easy: one row per run, one column per object with the outlier score only (this doesn't include above metadata - you have to know that e.g. FastABOD uses small scores for outliers; our evaluation tool has a regexp to recognize known methods based on the row label).

Supervised setting: ELKI only has very basic supervised methods such as kNN classification. Largely because Weka etc. were already rather good here, and it makes more sense to implement methods that aren't already somewhere else.

I don't know about end of February yet.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/349#issuecomment-263091369, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQV5D79TTssVlqWX-M7ddJ93RfO8EDks5rCLtJgaJpZM4K8g8O .

-- Thank you, Joaquin

kno10 commented 7 years ago

The data sets are rather similar, I'm not sure if adding another 100+ variants of the same UCI "mother set" (as it is called in their article) adds much; in particular as the UCI data sets are not very well suited for anomaly detection but need such derivations in the first place. Their work follow a much more complex derivation procedure (with a KLR to estimate the difficulty of an anomaly, and a very tight control of the exact rate of anomalies), too; while our repository follows the procedure you find in various published literature of random downsampling of the minority class; i.e. we tried to reproduce the data sets that were used in earlier work. What I like most about their work is adding confidence intervals to random rankings of ROC AUC / AveP to be able to check if a result is significantly better than a random result. This is an interesting reference value to judge how good results are really.