Open amueller opened 6 years ago
Indeed 'active' means verified in the sense that it was checked for a few things, e.g. whether it could be parsed. This is meant to be automated, but it's still manual right now.
Last week Jan implemented an API call that allows users to activate their own datasets. We're could put this online now. The question is whether we want to implement some simple tests in the backend before the dataset is actually activated. @janvanrijn: how do you see this?
Checking whether it could be parsed can easily be automated. So if the status of the 17k datasets is "we haven't tried parsing them" then we really should not advertise having 20k datasets and I feel it's questionable to write this in a grant or paper.
So we should settle on the semantics and what the default behavior of the API and website are. Just listing all datasets by default would make a very big difference in user experience, both for the uploader and the user. But that clearly doesn't make sense if we're not even sure we can parse the data.
We could even make all the currently active datasets "verified" and then do some more rudimentary testing (like parsing) to make all the rest active.
I don't think this is a question of implementing API calls. It's a question of what do we want the semantics to be and what do we want the user experience to be and then clearly communicate it. The implementing is the easy part.
Question: does the openml fetcher only fetch active datasets? If it fetches any dataset given the ID, then users can really access all of them, we just chose not list only active datasets in the frontend by default (I believe @amueller asked for this :)). You can switch off that filter to see all datasets in the frontend as well.
Also about 15,000 of those are drug discovery datasets. IIRC, Matthias asked to not make them active yet because some of them have classes with zero instances, which was an issue at the time. Not sure if this is still the case? If this is no longer a problem, I could activate all of them. Otherwise I first have to upload new versions for a subset of them, which may take me a while.
That doesn't clarify what "active" means, though. And yes, I'm all for not listing datasets that we can not parse.
@amueller:
To move things ahead, and automate activation (which would make me really happy :)), how about the following strategy:
Current work in writing this script is here: https://github.com/openml/OpenML/issues/476
It already has a simple script that lists datasets and tries to parse them using the Python API. If the python API is extended to also support the /data/status API, we can easily use it to activate all other good datasets.
The semantics here are defined as 'Can I load it into the Python API'? Ideally, we make this more general and also do a check with the R API, but one thing at a time.
Typical error found this way (that would block activation): General:
Sounds great. Though @janvanrijn just said most of the not active datasets contain invalid arff?
(sorry misclick)
I have to check, but indeed those QSAR datasets have some issues, see: https://github.com/openml/openml-python/issues/310
I'd like to check how many datasets (in general) are valid and not yet active.
from openml import datasets
import sys
offset = 0
size = 10000
dids = []
while size == 10000:
res = datasets.list_datasets(offset=offset, size=size, status='in_preparation')
dids.extend([did for did, dict in res.items()])
size = len(res)
offset += len(res)
openml_datasets = []
error_ids = []
for dataset_id in dids:
try:
openml_datasets.append(datasets.get_dataset(dataset_id))
print(dataset_id, end=' ')
except:
print("Unexpected error on dataset ",dataset_id,": " , sys.exc_info()[0])
error_ids.append(dataset_id)
This script seems to be happy with many datasets that are currently inactive. However, this is quite spammy, will try to come up with something better.
We made a recent change in the arff parser which allows parsing such files. Not sure if that's good, though...
You can also check out the scripts in https://github.com/openml/openml-serverdata-quality-bot which also try to download all datasets in order to check them.
@mfeurer are we doing anything with that bot / results yet? This seems great!
Nope, they only provide output that has to be manually interpreted.
These datasets are not parsable in python: [1438, 4800, 41190]
There are all parsable in python:
[473, 489, 1024, 1047, 1057, 1095, 1456, 4133, 4675, 4709, 40600, 40630, 40631, 40641, 40642, 40643, 40644, 40733, 40735, 40736, 40738, 40739, 40740, 40741, 40742, 40743, 40744, 40749, 40754, 40755, 40756, 40759, 40760, 40761, 40762, 40763, 40764, 40765, 40766, 40767, 40768, 40769, 40770, 40771, 40772, 40773, 40774, 40775, 40776, 40777, 40778, 40779, 40780, 40781, 40782, 40783, 40784, 40785, 40786, 40787, 40788, 40789, 40790, 40791, 40792, 40793, 40794, 40795, 40796, 40797, 40798, 40799, 40800, 40801, 40802, 40803, 40804, 40805, 40806, 40807, 40808, 40809, 40810, 40811, 40812, 40813, 40814, 40815, 40816, 40817, 40819, 40820, 40821, 40822, 40823, 40824, 40825, 40826, 40827, 40828, 40829, 40830, 40831, 40832, 40833, 40834, 40835, 40836, 40837, 40838, 40839, 40840, 40841, 40842, 40843, 40844, 40845, 40846, 40847, 40848, 40849, 40850, 40851, 40853, 40854, 40855, 40856, 40858, 40859, 40860, 40861, 40862, 40865, 40870, 40871, 40872, 40873, 40875, 40877, 40878, 40883, 40884, 40886, 40887, 40888, 40889, 40891, 40892, 40894, 40895, 40897, 40901, 40902, 40903, 40904, 40905, 40906, 40907, 40908, 40909, 40912, 40913, 40921, 40952, 40953, 40954, 40955, 40959, 40960, 40964, 40965, 40967, 40969, 40972, 40973, 40989, 40990, 40991, 40995, 41008, 41010, 41014, 41015, 41016, 41023, 41024, 41025, 41037, 41038, 41048, 41049, 41050, 41051, 41052, 41053, 41068, 41069, 41071, 41072, 41073, 41075, 41076, 41077, 41078, 41079, 41086, 41096, 41099, 41100, 41110, 41119, 41122, 41175, 41193, 41194, 41195, 41198]
Some of them are clearly iris: https://www.openml.org/d/40954 https://www.openml.org/d/40789 So activating them automatically seems like a bad idea
In many cases this is impossible to spot for humans. I mean, iris can be spotted easily, but other datasets are a bit more though. Especially this is something that (sh/c)ould be automated.
yes
how would you test this, though? compare the shape with all known datasets? The name? The feature names?
A good improvement over the current state would be a set of simple meta-features, like
Most others attributes are highly instable (e.g., missing values, mean skewness of numeric atts) and depend on the exact encoding how a user uploaded the dataset. Usually, class attributes don't have missing values so encoding is not really an issue. Number of Classes, majority class size and minority class size are also not perfectly stable, as they rely on a correctly specified target attribute.
I would also look at description ;)
But the question is: should we only activate once we have a better quality bot? Or activate now and later deactivate?
I would also look at description ;)
These are more often than not missing (mind that many descriptions are there as an effort from us for getting the OpenML100 in order).
should we only activate once we have a better quality bot?
I don't really know what is the status of alternatives meta-feature engines. I have been hearing about alternative engines for a while, but never saw any code, questions or emails yet.
Or activate now and later deactivate?
I think the biggest unanswered question is: Based on what criterion should a dataset be (de)activated. This is something that we hardly addressed and in particular never documented.
I would propose the following definition:
Btw, for in preparation @mfeurer's bot gives
[INFO] [14:09:18:openml_server_quality_bot.datasets.datasets] Intermediate results checking 220 datasets:
[INFO] [14:09:18:openml_server_quality_bot.datasets.datasets] Error ErrorCodes.NO_ERROR (206 datasets): 473, 1024, 1047, 1057, 1095, 1456, 4133, 40600, 40630, 40631, 40641, 40642, 40643, 40644, 40733, 40735, 40736, 40738, 40739, 40740, 40741, 40742, 40743, 40744, 40749, 40754, 40755, 40756, 40759, 40760, 40761, 40762, 40763, 40764, 40765, 40766, 40767, 40768, 40769, 40770, 40771, 40772, 40773, 40774, 40775, 40776, 40777, 40778, 40779, 40780, 40781, 40782, 40783, 40784, 40785, 40786, 40787, 40788, 40789, 40790, 40791, 40792, 40793, 40794, 40795, 40796, 40797, 40798, 40799, 40800, 40801, 40802, 40803, 40804, 40805, 40806, 40807, 40808, 40809, 40810, 40811, 40812, 40813, 40814, 40815, 40816, 40817, 40819, 40820, 40821, 40822, 40823, 40824, 40825, 40826, 40827, 40828, 40829, 40830, 40831, 40832, 40833, 40834, 40835, 40836, 40837, 40838, 40839, 40840, 40841, 40842, 40843, 40844, 40845, 40846, 40847, 40848, 40849, 40850, 40851, 40853, 40854, 40855, 40856, 40858, 40859, 40860, 40861, 40862, 40870, 40871, 40872, 40873, 40875, 40877, 40878, 40883, 40884, 40886, 40892, 40894, 40895, 40897, 40901, 40902, 40903, 40904, 40905, 40906, 40907, 40908, 40909, 40912, 40913, 40921, 40952, 40953, 40954, 40955, 40959, 40960, 40964, 40965, 40972, 40973, 40989, 40990, 40991, 40995, 41008, 41010, 41014, 41015, 41016, 41023, 41024, 41025, 41037, 41038, 41048, 41049, 41050, 41051, 41052, 41053, 41068, 41069, 41071, 41072, 41073, 41075, 41076, 41077, 41078, 41079, 41086, 41096, 41099, 41100, 41110, 41119, 41175, 41193, 41194, 41195, 41198
[INFO] [14:09:18:openml_server_quality_bot.datasets.datasets] Error ErrorCodes.DATASET_CONTAINS_STRING_FEATURES (11 datasets): 489, 4675, 4709, 40865, 40887, 40888, 40889, 40891, 40967, 40969, 41122
[INFO] [14:09:18:openml_server_quality_bot.datasets.datasets] Error ErrorCodes.OPENML_FEATURE_DESCRIPTION_ERROR (1 datasets): 1438
[INFO] [14:09:18:openml_server_quality_bot.datasets.datasets] Error ErrorCodes.OTHER_ERROR (1 datasets): 4800
[INFO] [14:09:18:openml_server_quality_bot.datasets.datasets] Error ErrorCodes.OPENML_HASH_EXCEPTION (1 datasets): 41190
[INFO] [14:09:18:openml_server_quality_bot.datasets.datasets] Warning Warnings.POTENTIAL_ID_COLUMN (66 datasets): 473, 1024, 1047, 1057, 4133, 40630, 40631, 40644, 40740, 40741, 40743, 40744, 40755, 40767, 40768, 40771, 40772, 40779, 40780, 40791, 40792, 40793, 40794, 40796, 40797, 40798, 40799, 40806, 40807, 40808, 40811, 40812, 40813, 40814, 40820, 40821, 40822, 40823, 40824, 40825, 40826, 40828, 40829, 40835, 40845, 40846, 40847, 40848, 40849, 40851, 40862, 40872, 40878, 40960, 40973, 41037, 41038, 41068, 41069, 41071, 41072, 41073, 41075, 41076, 41119, 41198
[INFO] [14:09:18:openml_server_quality_bot.datasets.datasets] Warning Warnings.POTENTIAL_DTYPE_ERROR (1 datasets): 4133
[INFO] [14:09:18:openml_server_quality_bot.datasets.datasets] Warning Warnings.POTENTIALLY_STREAM_DATA (11 datasets): 40759, 40760, 40779, 40780, 40781, 40782, 40960, 40972, 41024, 41099, 41100
[INFO] [14:09:18:openml_server_quality_bot.datasets.datasets] Warning Warnings.POTENTIALLY_ARTIFICIAL (15 datasets): 40765, 40766, 40775, 40776, 40777, 40778, 40779, 40780, 40783, 40784, 40892, 40897, 40959, 41024, 41096
These are all the "in preparation" classification datsets. This doesn't take duplicates into account and the ID columns have many false positives.
Re your definition: when should two datasets be versions of another? I feel adult
and adult-census
should be versions of the same data, but they are not.
I feel adult and adult-census should be versions of the same data, but they are not.
I completely agree. This requires two additions: 1) ability to annotate this in the db (I would be in favor of adding this, if this turns out to be a often occurring problem. I am afraid yes) 2) a system to detect this. I am afraid that the only thing that I can think of is community based.
Back to the original topic:
What to do with the datasets that are in preparation. Briefly talked with @amueller about it. He mentioned that most qsar datasets are parsable by scikit-learn. The proposal is to activate all datasets that are parsable by PHP (somewhat equivalent to scikit-learn) and weka automatically, and deactivate the other ones.
WDYT?
Yes, sounds good to me.
For the ones that do not pass the php test, can we store the error message? At some point I need to fix those (or not, but I need to check).
About deactivation: the semantics of this are as follows right?
In that case, it would make sense to deactivate them. I would also be in favor of using these terms in the frontend instead of active/deactivated?
Sorry which terms instead of which terms? Does the number on the front page include deactivated datasets?
I prefer ‘verified’ over ‘active’. Front page currently includes all datasets (also deactivated). When we have checked the current in_progress datasets, I will update that to only the verified (active) datasets. On Tue, 30 Oct 2018 at 00:53, Andreas Mueller notifications@github.com wrote:
Sorry which terms instead of which terms? Does the number on the front page include deactivated datasets?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/794#issuecomment-434110463, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQVxGMZiZHu8-KPgeucwz4_-rwEh_wks5up4bggaJpZM4W8tvz .
-- Thank you, Joaquin
@janvanrijn: Seems that we are all on the same page. Shall we proceed?
I remember that we had a discussion about adding a status 'error' (I can't find that issue though). This may be more clear, but on the other hand using 'deactivated' with an error message is equally good.
It this is done we could also close: https://github.com/openml/OpenML/issues/680
What's is the process to ask for review of a dataset labeled as "in preparation"?
Trying to use freMTPL2sev and freMTPL2freq in a scikit-learn example. Loading by ID works, but we get a warning about the dataset not being active. Or should we disable this warning in the scikit-learn OpenML fetcher?
@rth new uploads should be approved automatically. Previous uploads are basically stalled. It was decided (somewhere?) not to run the automatic approval on queued datasets. So the easiest way to get an approval for a dataset is uploading it again. This seems a strange solution to me, but I think that's what @janvanrijn recommended.
@rth new uploads should be approved automatically. Previous uploads are basically stalled
Correct.
It was decided (somewhere?) not to run the automatic approval on queued datasets. So the easiest way to get an approval for a dataset is uploading it again. This seems a strange solution to me, but I think that's what @janvanrijn recommended.
That sounds weird indeed, must be some miscommunication. Can you give me the id's of the stalled datasets?
473, 1024, 1057, 1095, 1231, 1243, 1244, 1456, 1576, 1947, 4133, 4536, 4539, 6333, 6334, 6335, 6336, 23389, 23411, 23417, 23418, 23419, 23425, 23428, 23455, 23466, 23485, 23490, 23500, 23501, 23502, 23503, 23504, 23505, 23506, 23507, 23510, 23511, 35983, 36354, 40362, 40471, 40500, 40501, 40508, 40510, 40521, 40533, 40534, 40598, 40599, 40600, 40630, 40631, 40641, 40642, 40643, 40644, 40716, 40717, 40718, 40719, 40720, 40721, 40722, 40723, 40724, 40725, 40729, 40730, 40731, 40733, 40735, 40736, 40737, 40738, 40739, 40740, 40741, 40742, 40743, 40744, 40746, 40749, 40750, 40751, 40752, 40754, 40755, 40756, 40757, 40758, 40759, 40760, 40761, 40762, 40763, 40764, 40765, 40766, 40767, 40768, 40769, 40770, 40771, 40772, 40773, 40774, 40775, 40776, 40777, 40778, 40779, 40780, 40781, 40782, 40783, 40784, 40785, 40786, 40787, 40788, 40791, 40792, 40793, 40794, 40795, 40796, 40797, 40798, 40799, 40800, 40801, 40802, 40803, 40804, 40805, 40806, 40807, 40808, 40809, 40810, 40811, 40812, 40813, 40814, 40815, 40816, 40817, 40818, 40819, 40820, 40821, 40822, 40823, 40824, 40825, 40826, 40827, 40828, 40829, 40830, 40831, 40832, 40833, 40834, 40835, 40836, 40837, 40838, 40839, 40840, 40841, 40842, 40843, 40844, 40845, 40846, 40847, 40848, 40849, 40850, 40851, 40853, 40854, 40855, 40856, 40858, 40859, 40860, 40861, 40862, 40865, 40870, 40871, 40872, 40873, 40875, 40876, 40877, 40878, 40883, 40884, 40886, 40887, 40888, 40889, 40891, 40892, 40894, 40895, 40897, 40901, 40902, 40903, 40904, 40905, 40906, 40907, 40908, 40909, 40912, 40913, 40917, 40921, 40925, 40952, 40953, 40954, 40955, 40957, 40958, 40959, 40960, 40964, 40965, 40967, 40968, 40969, 40972, 40973, 40989, 40990, 40991, 40995, 41008, 41010, 41012, 41014, 41015, 41016, 41018, 41019, 41023, 41024, 41025, 41029, 41030, 41031, 41032, 41037, 41038, 41041, 41042, 41043, 41048, 41049, 41050, 41051, 41052, 41053, 41060, 41063, 41064, 41066, 41068, 41069, 41071, 41072, 41073, 41074, 41075, 41076, 41077, 41078, 41079, 41086, 41088, 41089, 41090, 41092, 41093, 41094, 41095, 41096, 41099, 41100, 41102, 41110, 41111, 41113, 41114, 41115, 41116, 41117, 41118, 41119, 41120, 41121, 41122, 41123, 41171, 41174, 41175, 41177, 41190, 41191, 41193, 41194, 41195, 41198, 41199, 41204, 41205, 41206, 41207, 41208, 41210, 41211, 41212, 41214, 41215, 41216, 41217, 41218, 41219, 41220, 41221, 41222, 41223, 41224, 41225, 41226, 41227, 41229, 41230, 41231, 41232, 41233, 41234, 41235, 41236, 41237, 41238, 41239, 41240, 41241, 41243, 41244, 41245, 41246, 41247, 41248, 41249, 41250, 41251, 41255, 41259, 41260, 41261, 41262, 41263, 41264, 41266, 41267, 41271, 41275, 41278, 41283, 41287, 41289, 41290, 41291, 41307, 41308, 41309, 41311, 41312, 41313, 41317, 41318, 41319, 41320, 41321, 41322, 41323, 41324, 41325, 41326, 41327, 41328, 41329, 41330, 41331, 41332, 41333, 41334, 41335, 41336, 41337, 41338, 41339, 41340, 41341, 41342, 41343, 41344, 41345, 41346, 41347, 41348, 41349, 41350, 41351, 41352, 41353, 41354, 41355, 41356, 41357, 41358, 41359, 41392, 41393, 41394, 41395, 41396, 41397, 41398, 41399, 41400, 41401, 41402, 41403, 41404, 41405, 41406, 41407, 41408, 41409, 41411, 41415, 41416, 41417, 41418, 41419, 41420, 41421, 41424, 41425, 41426, 41427, 41428, 41430, 41434, 41435, 41436, 41437, 41438, 41439, 41440, 41441, 41442, 41443, 41444, 41445, 41446, 41447, 41448, 41450, 41454, 41455, 41457, 41458, 41480, 41481, 41493, 41503, 41505, 41512, 41513, 41520, 41541, 41560, 41578, 41584, 41585, 41605, 41606, 41607, 41608, 41609, 41610, 41611, 41612, 41613, 41614, 41615, 41616, 41617, 41618, 41619, 41620, 41621, 41622, 41623, 41624, 41625, 41626, 41627, 41628, 41629, 41630, 41631, 41632, 41633, 41634, 41635, 41636, 41637, 41638, 41639, 41640, 41641, 41642, 41643, 41644, 41645, 41646, 41647, 41648, 41649, 41650, 41651, 41652, 41653, 41654, 41655, 41656, 41657, 41658, 41659, 41660, 41673, 41676, 41683, 41686, 41688, 41908, 41979, 41984, 41985, 41987, 42073, 42075, 42077
I just realized that of the 20000 datasets we advertise, only about 2500 are available, the rest are in preparation. It's unclear what that means, given that they are mostly months or years old uploads.
If the semantics of "available" is "human (@joaquinvanschoren) verified" we should call it verified and also list unverified datasets by default.
This also seems to suggest that most people that uploaded something to openml never saw their dataset show up on the site, which is not great.
So we should do at least one of two things: a) activate or decline most datasets b) rephrase / rename what "active" means.
Right now I feel like saying that openml hosts 20000 datasets looks like it's stretching the truth (even though all the functionality might be available for "in preparation" datasets - not sure).
cc @berndbischl @janvanrijn @joaquinvanschoren @mfeurer