amueller commented 6 years ago

I just realized that of the 20000 datasets we advertise, only about 2500 are available, the rest are in preparation. It's unclear what that means, given that they are mostly months or years old uploads.

If the semantics of "available" is "human (@joaquinvanschoren) verified" we should call it verified and also list unverified datasets by default.

This also seems to suggest that most people that uploaded something to openml never saw their dataset show up on the site, which is not great.

So we should do at least one of two things: a) activate or decline most datasets b) rephrase / rename what "active" means.

Right now I feel like saying that openml hosts 20000 datasets looks like it's stretching the truth (even though all the functionality might be available for "in preparation" datasets - not sure).

cc @berndbischl @janvanrijn @joaquinvanschoren @mfeurer

joaquinvanschoren commented 6 years ago

Indeed 'active' means verified in the sense that it was checked for a few things, e.g. whether it could be parsed. This is meant to be automated, but it's still manual right now.

Last week Jan implemented an API call that allows users to activate their own datasets. We're could put this online now. The question is whether we want to implement some simple tests in the backend before the dataset is actually activated. @janvanrijn: how do you see this?

amueller commented 6 years ago

Checking whether it could be parsed can easily be automated. So if the status of the 17k datasets is "we haven't tried parsing them" then we really should not advertise having 20k datasets and I feel it's questionable to write this in a grant or paper.

So we should settle on the semantics and what the default behavior of the API and website are. Just listing all datasets by default would make a very big difference in user experience, both for the uploader and the user. But that clearly doesn't make sense if we're not even sure we can parse the data.

We could even make all the currently active datasets "verified" and then do some more rudimentary testing (like parsing) to make all the rest active.

I don't think this is a question of implementing API calls. It's a question of what do we want the semantics to be and what do we want the user experience to be and then clearly communicate it. The implementing is the easy part.

joaquinvanschoren commented 6 years ago

Question: does the openml fetcher only fetch active datasets? If it fetches any dataset given the ID, then users can really access all of them, we just chose not list only active datasets in the frontend by default (I believe @amueller asked for this :)). You can switch off that filter to see all datasets in the frontend as well.

Also about 15,000 of those are drug discovery datasets. IIRC, Matthias asked to not make them active yet because some of them have classes with zero instances, which was an issue at the time. Not sure if this is still the case? If this is no longer a problem, I could activate all of them. Otherwise I first have to upload new versions for a subset of them, which may take me a while.

amueller commented 6 years ago

That doesn't clarify what "active" means, though. And yes, I'm all for not listing datasets that we can not parse.

joaquinvanschoren commented 6 years ago

@amueller:

the API call is a necessary step to automate the checking. The REST API itself cannot do complicated (slow) things like parsing datasets, so this needs to be an external process.
we did actually try to parse them all. Lars created a CI job (in Jenkins) that tried to download and parse all datasets, IIRC using the R API. What we didn't yet do is then actually activate the datasets because there was no API call for that at the time. It also ran into memory issues when trying to load very large datasets. https://github.com/openml/OpenML/issues/331
we can easily list all datasets by default. It's just that we earlier agreed to only show active ones by default and allow users to drop that filter: https://github.com/openml/OpenML/issues/367
semantically, active really means verified. The thing is that we don't have difference in semantics between manually and automatically verfied, in part because we always meant to NOT do this manually, always automatically

To move things ahead, and automate activation (which would make me really happy :)), how about the following strategy:

We write a script that does a list of checks. It downloads all non-active datasets, tries to parse them (and maybe does some other checks), and then activates the datasets. We could do this in Python, as the liac-arff parser seems to be the most strict one. It should be feasible to add an R check as well
If the dataset passes the tests, it is automatically activated. If not, a useful error message is stored in the database and shown to the user.

joaquinvanschoren commented 6 years ago

Current work in writing this script is here: https://github.com/openml/OpenML/issues/476

It already has a simple script that lists datasets and tries to parse them using the Python API. If the python API is extended to also support the /data/status API, we can easily use it to activate all other good datasets.

The semantics here are defined as 'Can I load it into the Python API'? Ideally, we make this more general and also do a check with the R API, but one thing at a time.

Typical error found this way (that would block activation): General:

Dataset name has disallowed characters (should be URI safe)
Cannot download (typically this is a server issue)
Sparse ARFF while not indicated as such ARFF-specific:
End-of-line comments behind the attributes in the header
AttributeError: there can be no spaces before @Attribute
BadAttributeType: confusing list of values, e.g. {-1,1} instead of {-1, 1}
datetime attributes (not sure what is the issue, it seems to bother the liac_arff parser?)

amueller commented 6 years ago

Sounds great. Though @janvanrijn just said most of the not active datasets contain invalid arff?

amueller commented 6 years ago

476 is not a PR, it was a list of issues in active datasets. I'm not sure which script you're referring to.

amueller commented 6 years ago

(sorry misclick)

joaquinvanschoren commented 6 years ago

I have to check, but indeed those QSAR datasets have some issues, see: https://github.com/openml/openml-python/issues/310

I'd like to check how many datasets (in general) are valid and not yet active.

476 has a little script in the comments that you could modify, e.g.:

from openml import datasets
import sys

offset = 0
size = 10000
dids = []
while size == 10000:
    res = datasets.list_datasets(offset=offset, size=size, status='in_preparation')
    dids.extend([did for did, dict in res.items()])
    size = len(res)
    offset += len(res)
openml_datasets = []
error_ids = []
for dataset_id in dids:
    try:
        openml_datasets.append(datasets.get_dataset(dataset_id))
        print(dataset_id, end=' ')
    except:
        print("Unexpected error on dataset ",dataset_id,": " , sys.exc_info()[0])
        error_ids.append(dataset_id)

This script seems to be happy with many datasets that are currently inactive. However, this is quite spammy, will try to come up with something better.

mfeurer commented 6 years ago

We made a recent change in the arff parser which allows parsing such files. Not sure if that's good, though...

You can also check out the scripts in https://github.com/openml/openml-serverdata-quality-bot which also try to download all datasets in order to check them.

amueller commented 6 years ago

@mfeurer are we doing anything with that bot / results yet? This seems great!

mfeurer commented 6 years ago

Nope, they only provide output that has to be manually interpreted.

amueller commented 6 years ago

These datasets are not parsable in python: [1438, 4800, 41190]

There are all parsable in python:

[473, 489, 1024, 1047, 1057, 1095, 1456, 4133, 4675, 4709, 40600, 40630, 40631, 40641, 40642, 40643, 40644, 40733, 40735, 40736, 40738, 40739, 40740, 40741, 40742, 40743, 40744, 40749, 40754, 40755, 40756, 40759, 40760, 40761, 40762, 40763, 40764, 40765, 40766, 40767, 40768, 40769, 40770, 40771, 40772, 40773, 40774, 40775, 40776, 40777, 40778, 40779, 40780, 40781, 40782, 40783, 40784, 40785, 40786, 40787, 40788, 40789, 40790, 40791, 40792, 40793, 40794, 40795, 40796, 40797, 40798, 40799, 40800, 40801, 40802, 40803, 40804, 40805, 40806, 40807, 40808, 40809, 40810, 40811, 40812, 40813, 40814, 40815, 40816, 40817, 40819, 40820, 40821, 40822, 40823, 40824, 40825, 40826, 40827, 40828, 40829, 40830, 40831, 40832, 40833, 40834, 40835, 40836, 40837, 40838, 40839, 40840, 40841, 40842, 40843, 40844, 40845, 40846, 40847, 40848, 40849, 40850, 40851, 40853, 40854, 40855, 40856, 40858, 40859, 40860, 40861, 40862, 40865, 40870, 40871, 40872, 40873, 40875, 40877, 40878, 40883, 40884, 40886, 40887, 40888, 40889, 40891, 40892, 40894, 40895, 40897, 40901, 40902, 40903, 40904, 40905, 40906, 40907, 40908, 40909, 40912, 40913, 40921, 40952, 40953, 40954, 40955, 40959, 40960, 40964, 40965, 40967, 40969, 40972, 40973, 40989, 40990, 40991, 40995, 41008, 41010, 41014, 41015, 41016, 41023, 41024, 41025, 41037, 41038, 41048, 41049, 41050, 41051, 41052, 41053, 41068, 41069, 41071, 41072, 41073, 41075, 41076, 41077, 41078, 41079, 41086, 41096, 41099, 41100, 41110, 41119, 41122, 41175, 41193, 41194, 41195, 41198]

amueller commented 6 years ago

Some of them are clearly iris: https://www.openml.org/d/40954 https://www.openml.org/d/40789 So activating them automatically seems like a bad idea

janvanrijn commented 6 years ago

In many cases this is impossible to spot for humans. I mean, iris can be spotted easily, but other datasets are a bit more though. Especially this is something that (sh/c)ould be automated.

amueller commented 6 years ago

yes

amueller commented 6 years ago

how would you test this, though? compare the shape with all known datasets? The name? The feature names?

janvanrijn commented 6 years ago

A good improvement over the current state would be a set of simple meta-features, like

number of observations
number of features
number of classes
majority class size
minority class size
ClassEntropy

Most others attributes are highly instable (e.g., missing values, mean skewness of numeric atts) and depend on the exact encoding how a user uploaded the dataset. Usually, class attributes don't have missing values so encoding is not really an issue. Number of Classes, majority class size and minority class size are also not perfectly stable, as they rely on a correctly specified target attribute.

amueller commented 6 years ago

I would also look at description ;)

amueller commented 6 years ago

But the question is: should we only activate once we have a better quality bot? Or activate now and later deactivate?

janvanrijn commented 6 years ago

I would also look at description ;)

These are more often than not missing (mind that many descriptions are there as an effort from us for getting the OpenML100 in order).

should we only activate once we have a better quality bot?

I don't really know what is the status of alternatives meta-feature engines. I have been hearing about alternative engines for a while, but never saw any code, questions or emails yet.

Or activate now and later deactivate?

I think the biggest unanswered question is: Based on what criterion should a dataset be (de)activated. This is something that we hardly addressed and in particular never documented.

I would propose the following definition:

duplicate: only when the meta-features are exactly the same. So having a different encoding (e.g., by human imputed missing values) could legally constitute a new version of the dataset.
activated once we know that arff parses and it is no direct duplicate
deactivated once an error is spotted in the dataset (does not parse or does not have right target / ignore attributes)

amueller commented 6 years ago

Btw, for in preparation @mfeurer's bot gives

[INFO] [14:09:18:openml_server_quality_bot.datasets.datasets] Intermediate results checking 220 datasets:
[INFO] [14:09:18:openml_server_quality_bot.datasets.datasets] Error ErrorCodes.NO_ERROR (206 datasets): 473, 1024, 1047, 1057, 1095, 1456, 4133, 40600, 40630, 40631, 40641, 40642, 40643, 40644, 40733, 40735, 40736, 40738, 40739, 40740, 40741, 40742, 40743, 40744, 40749, 40754, 40755, 40756, 40759, 40760, 40761, 40762, 40763, 40764, 40765, 40766, 40767, 40768, 40769, 40770, 40771, 40772, 40773, 40774, 40775, 40776, 40777, 40778, 40779, 40780, 40781, 40782, 40783, 40784, 40785, 40786, 40787, 40788, 40789, 40790, 40791, 40792, 40793, 40794, 40795, 40796, 40797, 40798, 40799, 40800, 40801, 40802, 40803, 40804, 40805, 40806, 40807, 40808, 40809, 40810, 40811, 40812, 40813, 40814, 40815, 40816, 40817, 40819, 40820, 40821, 40822, 40823, 40824, 40825, 40826, 40827, 40828, 40829, 40830, 40831, 40832, 40833, 40834, 40835, 40836, 40837, 40838, 40839, 40840, 40841, 40842, 40843, 40844, 40845, 40846, 40847, 40848, 40849, 40850, 40851, 40853, 40854, 40855, 40856, 40858, 40859, 40860, 40861, 40862, 40870, 40871, 40872, 40873, 40875, 40877, 40878, 40883, 40884, 40886, 40892, 40894, 40895, 40897, 40901, 40902, 40903, 40904, 40905, 40906, 40907, 40908, 40909, 40912, 40913, 40921, 40952, 40953, 40954, 40955, 40959, 40960, 40964, 40965, 40972, 40973, 40989, 40990, 40991, 40995, 41008, 41010, 41014, 41015, 41016, 41023, 41024, 41025, 41037, 41038, 41048, 41049, 41050, 41051, 41052, 41053, 41068, 41069, 41071, 41072, 41073, 41075, 41076, 41077, 41078, 41079, 41086, 41096, 41099, 41100, 41110, 41119, 41175, 41193, 41194, 41195, 41198
[INFO] [14:09:18:openml_server_quality_bot.datasets.datasets] Error ErrorCodes.DATASET_CONTAINS_STRING_FEATURES (11 datasets): 489, 4675, 4709, 40865, 40887, 40888, 40889, 40891, 40967, 40969, 41122
[INFO] [14:09:18:openml_server_quality_bot.datasets.datasets] Error ErrorCodes.OPENML_FEATURE_DESCRIPTION_ERROR (1 datasets): 1438
[INFO] [14:09:18:openml_server_quality_bot.datasets.datasets] Error ErrorCodes.OTHER_ERROR (1 datasets): 4800
[INFO] [14:09:18:openml_server_quality_bot.datasets.datasets] Error ErrorCodes.OPENML_HASH_EXCEPTION (1 datasets): 41190
[INFO] [14:09:18:openml_server_quality_bot.datasets.datasets] Warning Warnings.POTENTIAL_ID_COLUMN (66 datasets): 473, 1024, 1047, 1057, 4133, 40630, 40631, 40644, 40740, 40741, 40743, 40744, 40755, 40767, 40768, 40771, 40772, 40779, 40780, 40791, 40792, 40793, 40794, 40796, 40797, 40798, 40799, 40806, 40807, 40808, 40811, 40812, 40813, 40814, 40820, 40821, 40822, 40823, 40824, 40825, 40826, 40828, 40829, 40835, 40845, 40846, 40847, 40848, 40849, 40851, 40862, 40872, 40878, 40960, 40973, 41037, 41038, 41068, 41069, 41071, 41072, 41073, 41075, 41076, 41119, 41198
[INFO] [14:09:18:openml_server_quality_bot.datasets.datasets] Warning Warnings.POTENTIAL_DTYPE_ERROR (1 datasets): 4133
[INFO] [14:09:18:openml_server_quality_bot.datasets.datasets] Warning Warnings.POTENTIALLY_STREAM_DATA (11 datasets): 40759, 40760, 40779, 40780, 40781, 40782, 40960, 40972, 41024, 41099, 41100
[INFO] [14:09:18:openml_server_quality_bot.datasets.datasets] Warning Warnings.POTENTIALLY_ARTIFICIAL (15 datasets): 40765, 40766, 40775, 40776, 40777, 40778, 40779, 40780, 40783, 40784, 40892, 40897, 40959, 41024, 41096

These are all the "in preparation" classification datsets. This doesn't take duplicates into account and the ID columns have many false positives.

Re your definition: when should two datasets be versions of another? I feel adult and adult-census should be versions of the same data, but they are not.

janvanrijn commented 6 years ago

I feel adult and adult-census should be versions of the same data, but they are not.

I completely agree. This requires two additions: 1) ability to annotate this in the db (I would be in favor of adding this, if this turns out to be a often occurring problem. I am afraid yes) 2) a system to detect this. I am afraid that the only thing that I can think of is community based.

janvanrijn commented 6 years ago

Back to the original topic:

What to do with the datasets that are in preparation. Briefly talked with @amueller about it. He mentioned that most qsar datasets are parsable by scikit-learn. The proposal is to activate all datasets that are parsable by PHP (somewhat equivalent to scikit-learn) and weka automatically, and deactivate the other ones.

WDYT?

joaquinvanschoren commented 6 years ago

Yes, sounds good to me.

For the ones that do not pass the php test, can we store the error message? At some point I need to fix those (or not, but I need to check).

About deactivation: the semantics of this are as follows right?

active: automatically verified
in_preparation: not verified yet
deactivated: error found (with error message)

In that case, it would make sense to deactivate them. I would also be in favor of using these terms in the frontend instead of active/deactivated?

amueller commented 6 years ago

Sorry which terms instead of which terms? Does the number on the front page include deactivated datasets?

joaquinvanschoren commented 6 years ago

I prefer ‘verified’ over ‘active’. Front page currently includes all datasets (also deactivated). When we have checked the current in_progress datasets, I will update that to only the verified (active) datasets. On Tue, 30 Oct 2018 at 00:53, Andreas Mueller notifications@github.com wrote:

Sorry which terms instead of which terms? Does the number on the front page include deactivated datasets?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/794#issuecomment-434110463, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQVxGMZiZHu8-KPgeucwz4_-rwEh_wks5up4bggaJpZM4W8tvz .

-- Thank you, Joaquin

joaquinvanschoren commented 5 years ago

@janvanrijn: Seems that we are all on the same page. Shall we proceed?

I remember that we had a discussion about adding a status 'error' (I can't find that issue though). This may be more clear, but on the other hand using 'deactivated' with an error message is equally good.

It this is done we could also close: https://github.com/openml/OpenML/issues/680

rth commented 5 years ago

What's is the process to ask for review of a dataset labeled as "in preparation"?

Trying to use freMTPL2sev and freMTPL2freq in a scikit-learn example. Loading by ID works, but we get a warning about the dataset not being active. Or should we disable this warning in the scikit-learn OpenML fetcher?

amueller commented 5 years ago

@rth new uploads should be approved automatically. Previous uploads are basically stalled. It was decided (somewhere?) not to run the automatic approval on queued datasets. So the easiest way to get an approval for a dataset is uploading it again. This seems a strange solution to me, but I think that's what @janvanrijn recommended.

janvanrijn commented 5 years ago

@rth new uploads should be approved automatically. Previous uploads are basically stalled

Correct.

It was decided (somewhere?) not to run the automatic approval on queued datasets. So the easiest way to get an approval for a dataset is uploading it again. This seems a strange solution to me, but I think that's what @janvanrijn recommended.

That sounds weird indeed, must be some miscommunication. Can you give me the id's of the stalled datasets?

amueller commented 5 years ago

473, 1024, 1057, 1095, 1231, 1243, 1244, 1456, 1576, 1947, 4133, 4536, 4539, 6333, 6334, 6335, 6336, 23389, 23411, 23417, 23418, 23419, 23425, 23428, 23455, 23466, 23485, 23490, 23500, 23501, 23502, 23503, 23504, 23505, 23506, 23507, 23510, 23511, 35983, 36354, 40362, 40471, 40500, 40501, 40508, 40510, 40521, 40533, 40534, 40598, 40599, 40600, 40630, 40631, 40641, 40642, 40643, 40644, 40716, 40717, 40718, 40719, 40720, 40721, 40722, 40723, 40724, 40725, 40729, 40730, 40731, 40733, 40735, 40736, 40737, 40738, 40739, 40740, 40741, 40742, 40743, 40744, 40746, 40749, 40750, 40751, 40752, 40754, 40755, 40756, 40757, 40758, 40759, 40760, 40761, 40762, 40763, 40764, 40765, 40766, 40767, 40768, 40769, 40770, 40771, 40772, 40773, 40774, 40775, 40776, 40777, 40778, 40779, 40780, 40781, 40782, 40783, 40784, 40785, 40786, 40787, 40788, 40791, 40792, 40793, 40794, 40795, 40796, 40797, 40798, 40799, 40800, 40801, 40802, 40803, 40804, 40805, 40806, 40807, 40808, 40809, 40810, 40811, 40812, 40813, 40814, 40815, 40816, 40817, 40818, 40819, 40820, 40821, 40822, 40823, 40824, 40825, 40826, 40827, 40828, 40829, 40830, 40831, 40832, 40833, 40834, 40835, 40836, 40837, 40838, 40839, 40840, 40841, 40842, 40843, 40844, 40845, 40846, 40847, 40848, 40849, 40850, 40851, 40853, 40854, 40855, 40856, 40858, 40859, 40860, 40861, 40862, 40865, 40870, 40871, 40872, 40873, 40875, 40876, 40877, 40878, 40883, 40884, 40886, 40887, 40888, 40889, 40891, 40892, 40894, 40895, 40897, 40901, 40902, 40903, 40904, 40905, 40906, 40907, 40908, 40909, 40912, 40913, 40917, 40921, 40925, 40952, 40953, 40954, 40955, 40957, 40958, 40959, 40960, 40964, 40965, 40967, 40968, 40969, 40972, 40973, 40989, 40990, 40991, 40995, 41008, 41010, 41012, 41014, 41015, 41016, 41018, 41019, 41023, 41024, 41025, 41029, 41030, 41031, 41032, 41037, 41038, 41041, 41042, 41043, 41048, 41049, 41050, 41051, 41052, 41053, 41060, 41063, 41064, 41066, 41068, 41069, 41071, 41072, 41073, 41074, 41075, 41076, 41077, 41078, 41079, 41086, 41088, 41089, 41090, 41092, 41093, 41094, 41095, 41096, 41099, 41100, 41102, 41110, 41111, 41113, 41114, 41115, 41116, 41117, 41118, 41119, 41120, 41121, 41122, 41123, 41171, 41174, 41175, 41177, 41190, 41191, 41193, 41194, 41195, 41198, 41199, 41204, 41205, 41206, 41207, 41208, 41210, 41211, 41212, 41214, 41215, 41216, 41217, 41218, 41219, 41220, 41221, 41222, 41223, 41224, 41225, 41226, 41227, 41229, 41230, 41231, 41232, 41233, 41234, 41235, 41236, 41237, 41238, 41239, 41240, 41241, 41243, 41244, 41245, 41246, 41247, 41248, 41249, 41250, 41251, 41255, 41259, 41260, 41261, 41262, 41263, 41264, 41266, 41267, 41271, 41275, 41278, 41283, 41287, 41289, 41290, 41291, 41307, 41308, 41309, 41311, 41312, 41313, 41317, 41318, 41319, 41320, 41321, 41322, 41323, 41324, 41325, 41326, 41327, 41328, 41329, 41330, 41331, 41332, 41333, 41334, 41335, 41336, 41337, 41338, 41339, 41340, 41341, 41342, 41343, 41344, 41345, 41346, 41347, 41348, 41349, 41350, 41351, 41352, 41353, 41354, 41355, 41356, 41357, 41358, 41359, 41392, 41393, 41394, 41395, 41396, 41397, 41398, 41399, 41400, 41401, 41402, 41403, 41404, 41405, 41406, 41407, 41408, 41409, 41411, 41415, 41416, 41417, 41418, 41419, 41420, 41421, 41424, 41425, 41426, 41427, 41428, 41430, 41434, 41435, 41436, 41437, 41438, 41439, 41440, 41441, 41442, 41443, 41444, 41445, 41446, 41447, 41448, 41450, 41454, 41455, 41457, 41458, 41480, 41481, 41493, 41503, 41505, 41512, 41513, 41520, 41541, 41560, 41578, 41584, 41585, 41605, 41606, 41607, 41608, 41609, 41610, 41611, 41612, 41613, 41614, 41615, 41616, 41617, 41618, 41619, 41620, 41621, 41622, 41623, 41624, 41625, 41626, 41627, 41628, 41629, 41630, 41631, 41632, 41633, 41634, 41635, 41636, 41637, 41638, 41639, 41640, 41641, 41642, 41643, 41644, 41645, 41646, 41647, 41648, 41649, 41650, 41651, 41652, 41653, 41654, 41655, 41656, 41657, 41658, 41659, 41660, 41673, 41676, 41683, 41686, 41688, 41908, 41979, 41984, 41985, 41987, 42073, 42075, 42077

openml / OpenML

Most datasets in preparation? #794

476 is not a PR, it was a list of issues in active datasets. I'm not sure which script you're referring to.

476 has a little script in the comments that you could modify, e.g.: