openml / OpenML

Open Machine Learning
https://openml.org
BSD 3-Clause "New" or "Revised" License
664 stars 91 forks source link

Possible test server dataset description schema error #721

Open ArlindKadra opened 6 years ago

ArlindKadra commented 6 years ago
<xs:simpleType name="casual_string1024">
<!-- Subset on xs:string. Highly restricted form of string. URL-friendly -->
<xs:restriction base="xs:string">
   <xs:pattern value="([a-zA-Z0-9_\-\.(),])+"/>
   <xs:maxLength value="1024"/>
   <xs:minLength value="1"/></xs:restriction>
</xs:simpleType>
<xs:simpleType name="casual_string128">
<!-- Subset on xs:string. Highly restricted form of string. URL-friendly -->
<xs:restriction base="xs:string">
   <xs:pattern value="([a-zA-Z0-9_\-\.(),])+"/>
   <xs:maxLength value="128"/><xs:minLength value="1"/>
   </xs:restriction>
</xs:simpleType>
<xs:simpleType name="casual_string64">
<!-- Subset on xs:string. Highly restricted form of string. URL-friendly -->
<xs:restriction base="xs:string">
   <xs:pattern value="([a-zA-Z0-9_\-\.(),])+"/>
   <xs:maxLength value="64"/>
   <xs:minLength value="1"/>
   </xs:restriction>
</xs:simpleType>

Should the ( and ) be preceded by \ since they are meta-characters used for grouping ? I am getting a validation error while trying to upload a dataset from scikit-learn. As an example, the license value 'BSD (from Scikit-learn)' does not pass.

ArlindKadra commented 6 years ago

@janvanrijn

janvanrijn commented 6 years ago

Thanks for reporting.

I did an update on several of the xsd schema's, including the one you mentioned. Should be fixed. Let me know if any other problems popped up.

Fast-forward
 .../pages/api_new/v1/xsd/openml.data.features.xsd  | 20 ++++-
 .../pages/api_new/v1/xsd/openml.data.qualities.xsd |  9 ++-
 .../pages/api_new/v1/xsd/openml.data.upload.xsd    | 17 +++--
 .../v1/xsd/openml.implementation.upload.xsd        | 30 ++++++--
 .../pages/api_new/v1/xsd/openml.run.trace.xsd      |  9 ++-
 .../pages/api_new/v1/xsd/openml.run.upload.xsd     |  2 +-
 .../api_new/v1/xsd/openml.task.types.search.xsd    | 87 ++++++++++++----------
ArlindKadra commented 6 years ago

Hey @janvanrijn , It is unfortunately still not working for the dataset that I am trying to upload. https://github.com/openml/OpenML/blob/7a1e4cfb96d58c5d20b4438e1c1102024dfd442b/openml_OS/views/pages/api_new/v1/xsd/openml.data.upload.xsd#L24

I think the license element validation is failing because the value contains spaces. So maybe we should add \s in the regex pattern.

Also the description element validation is failing. https://github.com/openml/OpenML/blob/7a1e4cfb96d58c5d20b4438e1c1102024dfd442b/openml_OS/views/pages/api_new/v1/xsd/openml.data.upload.xsd#L17

The description contains these characters =, :, -, ^, /, ", Maybe we should use a different encoding ?

janvanrijn commented 6 years ago

We can make the license field basic Latin 64. Do you think the description problem can be fixed with a different encoding?

ArlindKadra commented 6 years ago

For the license field, we can do whatever you think is best. For the description problem, on a second thought, I have to look it up more as the characters might be contained in the set.

janvanrijn commented 6 years ago

should be better now?

ArlindKadra commented 6 years ago

Hey @janvanrijn , The license element is ok now, however the description is failing because it is longer than the max value of 1024. The description contains 5023 characters and it contains a lot of whitespace characters. The number of non-whitespace characters is 3504. How should we deal with this one ? Should we just increase the limit ? Ps. It's the dataset BreastCancer from scikit-learn.

janvanrijn commented 6 years ago

I extended it