openml / OpenML

Open Machine Learning
https://openml.org
BSD 3-Clause "New" or "Revised" License
664 stars 90 forks source link

wrong number.of.instances info provided by evaluation/list? #402

Open giuseppec opened 7 years ago

giuseppec commented 7 years ago

The run https://www.openml.org/api/v1/xml/evaluation/list/run/1845870 tells me that the task with id 3057 and the data set with id 337 was used. However, the listing shows me that the number of instances is 115 although the data set with id 337 (arff file https://www.openml.org/data/download/52240/SPECTF.ARFF) has 349 instances. How can something like this happen?

Related to https://github.com/openml/openml-r/issues/299

joaquinvanschoren commented 7 years ago

"the listing shows me that the number of instances is 115"

Which listing is this? This is not returned by the evaluation listing? If I ask for the qualities of the datasets, I get the correct number (349). https://www.openml.org/api/v1/xml/data/qualities/337

giuseppec commented 7 years ago

https://www.openml.org/api/v1/xml/evaluation/list/run/1845870 contains a field number_of_instances with the value 115:

<oml:run_id>1845870</oml:run_id>
<oml:task_id>3057</oml:task_id>
<oml:setup_id>28846</oml:setup_id><oml:flow_id>5434</oml:flow_id>
<oml:flow_name>mlr.classif.develpartykit.ctree(1)</oml:flow_name>
<oml:data_name>SPECTF</oml:data_name>
<oml:function>number_of_instances</oml:function>
<oml:upload_time>2016-12-13 08:51:55</oml:upload_time>
<oml:value>115</oml:value>
<oml:array_data>[33,82]</oml:array_data>
janvanrijn commented 7 years ago

It's what the evaluation engine thinks it evaluated, i.e., the size of the test set. In this case, 115 instances

2017-04-03 10:23 GMT+02:00 giuseppec notifications@github.com:

https://www.openml.org/api/v1/xml/evaluation/list/run/1845870 contains a field number_of_instances with the value 115:

1845870 3057 288465434 mlr.classif.develpartykit.ctree(1) SPECTF number_of_instances 2016-12-13 08:51:55 115 [33,82] — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub , or mute the thread .
giuseppec commented 7 years ago

Wait. This is then at least misleading because users do not expect a different figure, right? And: It does not seem to be consistent, e.g., https://www.openml.org/api/v1/xml/evaluation/list/run/1845871 shows the number_of_instances of the full data sets.

janvanrijn commented 7 years ago

Probably we should document it better.

Actually I don't expect any bugs there, as it is calling straight weka code. Maybe holdout vs repeated full cv?

On 3 Apr 2017 14:33, "giuseppec" notifications@github.com wrote:

Wait. This is then at least misleading because users do not expect a different figure, right? And: It does not seem to be consistent, e.g., https://www.openml.org/api/v1/xml/evaluation/list/run/1845871 shows the number_of_instances of the full data sets.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/402#issuecomment-291129466, or mute the thread https://github.com/notifications/unsubscribe-auth/ACL7-qsAEwkdtRlEi8i4Jlu_ioHBKujMks5rsOcpgaJpZM4MsOOM .