Open amueller opened 6 years ago
For all entities, the rules for uploading are specified in the XSD documents: https://github.com/openml/website/tree/master/openml_OS/views/pages/api_new/v1/xsd
These follow the old naming convention, but you are looking for openml.data.upload.xsd
So _\-\.(),
are allowed? Are they tested? And is there a link in the docs to the xsd files ;)
Note that the slashes are escape symbols, these can not occur in names. For XSD checking we use an external library, I assume this particular function to be pretty reliable.
It could be that there are some legacy datasets that adhere to an older specification of the XSD schema.
Not linked in the docs (yet). Let's keep this issue open until that is done.
The questions was less about whether the XSD is valid than whether the website and APIs would be able to handle other symbols.
I think I don't really get your question. Can you rephrase?
In my personal opinion, restricting these fields as much as is practically possible is a good practise. The API's should (in theory) be able to handle a wider set of symbols. Is there any (set of) character(s) that you need in particular?
No, I'm just wondering whether it makes sense to support special characters in the sklearn openml loader.
For the tags for example, it was possible to create tags with apostrophes in them, but not to delete them. I could imagine similar edge cases happen with names. I agree it would be good to restrict them. Right now I didn't find any datasets that had any of these symbols in their name, so my question was whether in practice we expect the API / interfaces to handle these.
The possibility to add special characters in tags is a bug (my bad..). This is the only function that I am aware of that does not rely on XML (and thus also not on XSD) schema's, which is an additional source of potential errors.
I will make a pass over all api functions that actually insert something in the db and see how they handle input, to make sure that there is not another of these issues.
(Potentially related to this issue: https://github.com/openml/openml-python/issues/378)?
From a developers POV, the ideal situation would be that the sklearn loader writes an error into this issue tracker when an unexpected symbol is found.
ok, but unexpected is outside of that schema, not alphanumeric. Alright.
unexpected is outside of that schema, not alphanumeric.
Exactly.
What are the requirements for dataset names? And is that documented somewhere?