openml / OpenML

Open Machine Learning
https://openml.org
BSD 3-Clause "New" or "Revised" License
664 stars 90 forks source link

Q: valid symbols for dataset names #522

Open amueller opened 6 years ago

amueller commented 6 years ago

What are the requirements for dataset names? And is that documented somewhere?

janvanrijn commented 6 years ago

For all entities, the rules for uploading are specified in the XSD documents: https://github.com/openml/website/tree/master/openml_OS/views/pages/api_new/v1/xsd

These follow the old naming convention, but you are looking for openml.data.upload.xsd

amueller commented 6 years ago

So _\-\.(), are allowed? Are they tested? And is there a link in the docs to the xsd files ;)

janvanrijn commented 6 years ago

Note that the slashes are escape symbols, these can not occur in names. For XSD checking we use an external library, I assume this particular function to be pretty reliable.

It could be that there are some legacy datasets that adhere to an older specification of the XSD schema.

Not linked in the docs (yet). Let's keep this issue open until that is done.

amueller commented 6 years ago

The questions was less about whether the XSD is valid than whether the website and APIs would be able to handle other symbols.

janvanrijn commented 6 years ago

I think I don't really get your question. Can you rephrase?

In my personal opinion, restricting these fields as much as is practically possible is a good practise. The API's should (in theory) be able to handle a wider set of symbols. Is there any (set of) character(s) that you need in particular?

amueller commented 6 years ago

No, I'm just wondering whether it makes sense to support special characters in the sklearn openml loader.

For the tags for example, it was possible to create tags with apostrophes in them, but not to delete them. I could imagine similar edge cases happen with names. I agree it would be good to restrict them. Right now I didn't find any datasets that had any of these symbols in their name, so my question was whether in practice we expect the API / interfaces to handle these.

janvanrijn commented 6 years ago

The possibility to add special characters in tags is a bug (my bad..). This is the only function that I am aware of that does not rely on XML (and thus also not on XSD) schema's, which is an additional source of potential errors.

I will make a pass over all api functions that actually insert something in the db and see how they handle input, to make sure that there is not another of these issues.

(Potentially related to this issue: https://github.com/openml/openml-python/issues/378)?

From a developers POV, the ideal situation would be that the sklearn loader writes an error into this issue tracker when an unexpected symbol is found.

amueller commented 6 years ago

ok, but unexpected is outside of that schema, not alphanumeric. Alright.

janvanrijn commented 6 years ago

unexpected is outside of that schema, not alphanumeric.

Exactly.