Artificial vs Simulated datasets

mfeurer commented 6 years ago

Currently, we actually have a some simulated datasets in our list of datasets, but also removed several simulated datasets as being "artificial". However, it is very unclear where to draw the line and based on what criteria we would include a dataset as being simulated or exclude it as being artificial.

Examples of simulated datasets in our list:

MagicTelescope
higgs

Examples of artificial datasets in our list:

waveform-5000

joaquinvanschoren commented 6 years ago

Magic and Higgs are data from complex (and realistic) physics simulations, right? So in my view that's different from generating data through simple functions. We could label them 'simulation data' or something like that?

On Wed, 14 Mar 2018 at 08:26, Matthias Feurer notifications@github.com wrote:

Currently, we actually have a some simulated datasets in our list of datasets, but also removed several simulated datasets as being "artificial". However, it is very unclear where to draw the line and based on what criteria we would include a dataset as being simulated or exclude it as being artificial.

Examples of simulated datasets in our list:

MagicTelescope

higgs

Examples of artificial datasets in our list:

waveform-5000

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openml/OpenMLFirstBenchmarkSuite/issues/28, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQV_pYA4euoxhO3e_rPosHaqEAT_urks5teMYNgaJpZM4Sp8sF .

-- Thank you, Joaquin

mfeurer commented 6 years ago

I totally agree. We should keep simulated data and only remove artificial one. The only questions are where to draw the line and does everyone agree on this.

janvanrijn commented 6 years ago

TODO1: remove waveform-500 TODO2: once more go over list of currently 'artificial' datasets and check these

mfeurer commented 6 years ago

Here's the promised 'classification' of artificial datasets:

Datasets generated with a Bayesian Network Generator by @janvanrijn : 1201, 1185, 1188, 1189, 1191, 1193, 1194, 1195, 1197, 1198, 1199, 1200, 1202, 1204, 1205, 1213, 1203, 255, 259, 265, 267, 119, 1178, 1180, 1192, 1214, 251, 1196, 40514, 1190, 246, 258, 261, 266, 268, 269, 124, 125, 137, 140, 142, 143, 1179, 1183, 1206, 1207, 1210, 1211, 40518, 249, 257, 260, 271, 122, 126, 131, 133, 136, 138, 139, 141, 1181, 1182, 1369, 1370, 1371, 1372, 1373, 1374, 1375, 1376, 1377, 1209, 1212, 1187, 40519, 40515, 245, 247, 248, 250, 252, 254, 264, 120, 127, 130, 135, 144, 146, 147, 1186, 1360, 1361, 1362, 1363, 1364, 1365, 1366, 1367, 1368, 1378, 1379, 1380, 1381, 1382, 1383, 1384, 1385, 1386, 1393, 1394, 1395, 1396, 1397, 1398, 1399, 1400, 1208, 40520, 244, 263, 129, 132, 1353, 1354, 1355, 1356, 1357, 1358, 1359, 1351, 1352, 70, 253, 256, 272, 116, 117, 121, 148, 1177, 40516, 118, 262, 115, 123, 128, 134, 1387, 1388, 1389, 1390, 1391, 1392, 1184
Datasets generated with AutoUniv: 1547, 1552, 1553, 1554, 1549, 1555, 1548, 1551
Datasets simulated with GAMETES: 40646, 40647, 40649, 40650, 40648, 40645
Monks: 333, 334, 335
Other: 1460, 1459, 40680, 40693, 40690, 40706, 1507, 40678, 1496, 40496, 40677, 1566, 1479, 377

janvanrijn commented 6 years ago

cool, thanks for taking care of this.

So if I am correct, the only datasets up for discussion are 'gametes', which could (due to the simulation criteria) be added. All the others are clearly artificial and should not be included.

mfeurer commented 6 years ago

I think so, too.

janvanrijn commented 6 years ago

I quickly skimmed through the paper, and although my knowledge about biological processes is very limited the GAMETES datasets do not seem to simulate any process occurring in nature, but rather an abstract mathematical model. For this reason I would classify this as artificial, rather than simulation.

Anyone who disagrees? In particular, what do @berndbischl @frank-hutter @joaquinvanschoren @mfeurer think?

berndbischl commented 6 years ago

i know you dont want to hear this, but our rule for exclusion seems a bit bad. as we now have to argue whether certain simulators simulate realistic stuff or not.

i looked a bit at the paper. i cannot read every line now due to very severe time constraints. but the authors certainly seem to claim that they simulate something useful and realistic.

i dont know whether we can / should spend hours now discussing how correct that claim is? i certainly cannot do that quickly, and will lack time (and expertise!) in the future, in general.....

ideas?

mfeurer commented 6 years ago

I think @berndbischl raises an important issue. While we have automated parts of the benchmark generation or are working on this, such criteria cannot be automated (yet). Unfortunately, I do not know a solution to the particular problem, especially as the authors state (about the proposed method):

While the probability is low that these types of ‘extreme’ epistatic interactions occur in biology by chance alone, we instead focus on the fact that they ‘can’ occur. With that in mind, our focus on pure strict epistasis is intended to promote the development of strategies that can accommodate even the most challenging relationships. In doing so, we make minimal assumptions about the true nature of biological interaction.

This is an open point and we might want to discuss this in the paper or the longer paper?

janvanrijn commented 6 years ago

Thanks both for your replies. I think you raised an important issue. It is indeed a problem that this is something that can not be automated, and as far as I am concerned there is not really a way to automate this. Even the criterion 'real world dataset' vs. 'artificial/simulated dataset' is something that can not be automated.

There are several solutions that we can discuss: 1) We decide to remove both the artificial and the simulated datasets. Although formally this seems easy, I think among the datasets are still a lot of datasets that are open for debate in this regard (i.e., the datasets based on games, higgs and probably more). 2) As we are the creators of the benchmark suite, we have a responsibility of making sure that the inclusion criteria are met in a reasonable way, but also some leeway in the decision process. As the GAMETES datasets are the only ones that are under consideration because of this rule, we could email the authors and ask them how they look at this. 3) We can leave this criterion up to the author of the publication. If the author claims the dataset to be a simulation, we consider it a simulation. If the author claims it to be artificial, we consider it artificial. This seems like a reasonably low effort solution (we have skimmed for every dataset the associated paper anyway) and also highly objective (at least from our side). 4) Drop the complete requirement (no artificial / simulated). This would make our lives a lot easier and would make the benchmark suite quite bigger. Still the Bayesian Network Generated datasets will be left out (as these are derived from another OpenML dataset) and concepts that are too simple due to the decision tree rule / single feature as well. We will be left with high quality artificial datasets, as they all have in common that a set of authors took the effort to defend in a paper why this is a good dataset.

My personal preference goes out to option 3 or 4. Even though option 4 was more or less an after thought that developed while writing it down, the more I think about it the better it seems. I could make an investigation how many artificial datasets will be in the suite if we were to drop this requirement. Any thoughts?

mfeurer commented 6 years ago

We (@berndbischl @janvanrijn @mfeurer) decided to be conservative and not use anything artificial. Therefore, we'll kick out magic telescope (1120).

openml / benchmark-suites

Artificial vs Simulated datasets #28