openml / OpenML

Open Machine Learning
https://openml.org
BSD 3-Clause "New" or "Revised" License
667 stars 90 forks source link

Uploading of implementations with neither source nor binary file #22

Closed berndbischl closed 11 years ago

berndbischl commented 11 years ago

When thinking about uploading our first experiments, I noticed that sometimes I maybe do not want to upload either a source file or a binary file.

This mainly concerns applying "standard methods" from libraries. E.g., when I apply the libsvm implementation in the R package e1071, I only need to know the package name and the version number. Uploading the package itself (in binary or source form) makes no sense, this is hosted on the official CRAN R package server.

I could upload a very short code that uses this package and produces the desired predictions. Actually there are a few more subtle questions involed here and it might be easier to discuss them briefly on Skype, I would like to hear your opinions on this.

The question basically is, how much we want to enable users that download implementations to rerun the experiments in a convenient fashion.

ltorgo commented 11 years ago

A few comments here. To me what one uploads from R is a source file, even if it contains a simple script calling a standard function in some R package. Uploading the package makes no sense, I agree. In my opinion this is not different from the other tools. For instance, a knime node for me is equivalent to an R language function. You do not upload these functions as you do not upload knime nodes. What you upload are knime workflows, which basically are sequences of knime nodes. So what you upload in R is the same, a sequence of R language function calls, i.e. a script. If a script is simply a single function call, then one my wonder whether it makes sense to share it, so I would say that "interesting" scripts that are worthwhile sharing typically will do a bit more than calling an out-of-the-box algorithm function that is already available in some R package / function.

Luis

On 22-08-2013 01:48, berndbischl wrote:

When thinking about uploading our first experiments, I noticed that sometimes I do not want to upload either a source file or a binary file.

This mainly concerns applying "standard methods" from libraries. E.g., when I apply the libsvm implementation in the R package e1071, I only need to know the package name and the version number. Uploading the package itself (in binary or source form) makes no sense, this is hosted on the official CRAN R package server.

I /could/ upload a very short code that uses this package and produces the desired predictions. Actually there are a few more subtle questions involed here and it might be easier to discuss them briefly on Skype, I would like to hear your opinions on this.

The question basically is, how much we want to enable users that download implementations to rerun the experiments in a convenient fashion.


Reply to this email directly or view it on GitHub https://github.com/openml/OpenML/issues/22.

joaquinvanschoren commented 11 years ago

I have been thinking about this as well. Here is what I propose:

1) Currently, dataset/implementation upload requires a file (binary or dataset). This does not always make sense. Sometimes the code or dataset is hosted elsewhere (CRAN, SourceForge,...), or the dataset is too big to store, or the dataset comes from a webservice, or the user only wants to host the implementation/dataset on her own server. In all those cases, it should be possible to just provide a url. The server should then check if it is a valid resource and perform a checksum to see if the version hasn't changed.

2) When your run uses a standard library method (e.g. libsvn from package X), you upload it as an implementation with name, version, description and a url. That url could, for instance, point to a file on CRAN.

3) The more I think about it, the implementation name-version combo's seem like a bad idea. Maybe implementations should just have a numerical id (1,2,3), like the datasets, and a separate name and version. You then check whether an implementation exists by giving a name and version separately, or just by giving the id number. Storing implementation using a simple id also helps with fool-proofing. Sometimes a method version remains unchanged while the library changes, which may actually provide a different result. Thus, a new implementation id should be created if the library version changes. Also, a user may have changed the code without changing the version number. We could catch that by doing a checksum on the source/binary. If the checksum changes, it should also produce a new implementation id.

4) The call you do to start a procedure is not something we currently store, but I think this is valuable information for novice users who are less familiar with the libraries/tools used, or when we want to automatically rerun experiments (not a current requirement but good if we store the information for that). My proposal is to NOT upload a new implementation for that, but to add a new optional attribute to the run upload, say 'start_command', where you state how you start the procedure. This should probably not be required, as I can imagine situations where there is no call (e.g. tools without a CLI).

This implies the following changes:

What do you think? Does this solve our current/future problems?

Cheers, Joaquin

joaquinvanschoren commented 11 years ago

Any comments? Do you think we should move in this direction?

berndbischl commented 11 years ago

Hi,

Apologies for the delay regarding this important issue.

I mainly agree with what you write, especially about linking to publicly available stuff on servers. As a side note: You already uploaded stuff on OpenML which is simply available as a part of WEKA right? Just asking, not a criticism.

But here is my main point, and I want to discuss this properly before we start changing things:

What is actually the formal idea of the thing we call "implementation". (For now, think about our current classification /regression tasks)

Here are three options:

a) Can be anything, as long as it gives other users an idea what happened in the experiment.

b) It is a machine learning algorithm. This is then an "object" that has a training and a prediction method.

c) It is a workflow that takes an openml task, resamples it and produces the openml output. So this is then like a function: task.id, parameters --> resampled run result

The documentation seems to be torn between b) and c): "An implementation can be a single algorithm or a composed workflow."

joaquinvanschoren commented 11 years ago

As a side note: You already uploaded stuff on OpenML which is simply available as a part of WEKA right? Just asking, not a criticism.

Yes, I just registered the algorithm used and gave the correct version of the weka jar as the binary. That way the results are linked to the algorithm name, and people can download that version of weka to repeat the experiment.

What is actually the formal idea of the thing we call "implementation".

I see your point. I would say the implementation is the piece of software that produces the output, but separate from the evaluation procedure. Does that make sense? As such, it should not be the OpenML workflow/script itself. It should be named after the procedure that actually produces the output, e.g. 'R.libsvm', not 'openmlworkflow123'.

I should be able to run the implementation with the same evaluation procedure and get the same result, but I should also be able to run the same implementation in a different evaluation procedure.

If you believe we should also store the OpenML-wrapper, I would store that separately?

That, at least, is how I have done it until now. What do you think?

Cheers, Joaquin

(For now, think about our current classification /regression tasks)

Here are three options:

a) Can be anything, as long as it gives other users an idea what happened in the experiment.

b) It is a machine learning algorithm. This is then an "object" that has a training and a prediction method.

c) It is a workflow that takes an openml task, resamples it and produces the openml output. So this is then like a function: task.id, parameters --> resampled run result

The documentation seems to be torn between b) and c): "An implementation can be a single algorithm or a composed workflow."

— Reply to this email directly or view it on GitHubhttps://github.com/openml/OpenML/issues/22#issuecomment-23716034 .

Dr. Ir. Joaquin Vanschoren

Leiden Institute of Advanced Computer Science (LIACS) Universiteit Leiden Niels Bohrweg 1, 2333 CA Leiden, The Netherlands office: 1.14 phone: +31 715 27 89 19 fax: +32 16 32 79 96 mobile: (+32) (0)497 90 30 69

berndbischl commented 11 years ago

Yes, I just registered the algorithm used and gave the correct version of the weka jar as the binary. That way the results are linked to the algorithm name, and people can download that version of weka to repeat the experiment.

But later on we would encourage people to simply provide a URL to that WEKA release on the official WEKA page, right?

I see your point. I would say the implementation is the piece of software that produces the output, but separate from the evaluation procedure. Does that make sense? As such, it should not be the OpenML workflow/script itself. It should be named after the procedure that actually produces the output, e.g. 'R.libsvm', not 'openmlworkflow123'.

I should be able to run the implementation with the same evaluation procedure and get the same result, but I should also be able to run the same implementation in a different evaluation procedure.

The 2nd paragraph helped a bit, but I am still unsure what I should produce in R, especially w.r.t. to your last sentence above. Could we discuss this briefly on Skype next week? I assume that would work a lot better and has less potential for misunderstandings.

Also:Could you please send my some kind of screenshot of a workflow in WEKA that you would upload?

If you believe we should also store the OpenML-wrapper, I would store that separately?

This is closely connected to the stuff above, lets postpone it for now.

joaquinvanschoren commented 11 years ago

Sorry about the slow reply, many deadlines :/.

I guess a Skype call is a good idea. Do you want to do it this afternoon? Otherwise Wednesday or Thursday is also ok for me. I guess this is just between you, me and Jan? If others want to join, let me know.

Jan, what exactly are you doing in WEKA/RapidMiner? Do you upload the whole workflow (including OpenML operators) or only the workflow/algorithm that actually produces the predictions given train/test data?

Cheers, Joaquin

joaquinvanschoren commented 11 years ago

Sorry, I assume Luis would also be interested?

Cheers, Joaquin

berndbischl commented 11 years ago

Sorry, hurt my back somewhat during the last days. I would prefer to Skype next week, also rather late in the afternoon / evening. Monday till Wednesday are good, Thursday and Friday not.

joaquinvanschoren commented 11 years ago

Get well soon! Is Monday at 17h CET ok?

Cheers, Joaquin

On Friday, 13 September 2013, berndbischl wrote:

Sorry, hurt my back somewhat during the last days. I would prefer to Skype next week, also rather late in the afternoon / evening. Monday till Wednesday are good, Thursday and Friday not.

— Reply to this email directly or view it on GitHubhttps://github.com/openml/OpenML/issues/22#issuecomment-24368953 .

Dr. Ir. Joaquin Vanschoren

Leiden Institute of Advanced Computer Science (LIACS) Universiteit Leiden Niels Bohrweg 1, 2333 CA Leiden, The Netherlands mobile: (+32) (0)497 90 30 69

berndbischl commented 11 years ago

Get well soon! Is Monday at 17h CET ok?

Yes. Noted.

joaquinvanschoren commented 11 years ago

Bernd and I discussed this issue on Skype today, and I think it is important that we all at least think about this briefly.

We need to clearly state what we expect to be uploaded as an implementation. It must be easy for developers to upload what they have developed, but it should also be easy for people who discover a nice implementation on OpenML to download and use it. There should be as little second guessing as possible.

The basic signature of an implementation(=algorithm, script, workflow,...) could be simply the following:

implementation(openmltask,parameters...) -> expected outputs

Here, 'openmltask' is a language-specific object that represents an OpenML task. We can provide helper functions (for R, Java, Python,...) or workflow operators that create such a task object given a task_id by talking to the OpenML API and downloading the necessary data. Thus something like: createOpenMLTask(task_id) -> openmltask.

How we build that openmltask and how we send the results back is thus NOT part of the uploaded implementation. For workflows, this means uploading the subworkflow between the 'import OpenML task' and 'export OpenML result' operators. For R this means uploading the function that consumes the openmltask and produces the required output. Does this seem feasible and practical?

Note that this allows you to create 'custom' openmltasks that do NOT belong to a task_id, by writing your own function/operator, e.g. createOpenMLTask(input1, input2,...) -> openmltask

This can be useful when you have proprietary data that you don't want to upload, but still you may want to run experiments in the same fashion (e.g. with the same cross-validation folds) as other OpenML experiments. Or maybe you want to experiment with new task types. You won't be able to upload the results of these experiments though, not until the dataset or task_type is added to OpenML and their task_ids are generated. Still, it allows you to experiment freely beyond the tasks that we provide.

You can also provide a list of parameters that belong to your implementation. Say you have a new SVM implementation, it might look like this:

mySVMWorkflow(openmltask, C, sigma)

As such, you can run the same implementation with many different parameter settings (as long as you externalise them). The parameter values must be sent along when you upload your run. If you decide to externalise/add a new parameter, it should be handled as a new implementation since the signature has changed.

When you want to repeat an OpenML experiment, you download the implementation and the helper functions/operators, and you start it again on the task_id (and parameter setting) in question. Or you can run it with other parameters or on your 'custom' task.

I think we should still provide a way to indicate when you are simply wrapping a library function (e.g. WEKA's J48). Also, when you upload a workflow it should be clear which environment you need to run it. Should we add a (list of) dependencies together with the uploaded implementation, with urls where you can download them?

Please let us hear what you think. We should decide upon this fairly quickly.

Thanks, Joaquin

PS. For now, this is a recommendation (a best practice), we won't be blocking implementations that look differently just yet.

berndbischl commented 11 years ago

I think we should still provide a way to indicate when you are simply wrapping a library function (e.g. WEKA's J48).

Have a special attribute in the implementation xml that distinguishes between "custom workflows" and "library algorithms"?

Also, when you upload a workflow it should be clear which environment you need to run it. Should we add a (list of) dependencies together with the uploaded implementation, with urls where you can download them?

I though we had this already? Element "dependencies" in XSD for implementation?

dominikkirchhoff commented 11 years ago

I have a question: What if I'm a lazy user and I want to run (the newest version of) a certain standard method. Do I have to 'upload' it to check if it's already there and get an id or will there be a possibility to ask the server about all id's for a given name?

The question is: Can I get the results of the SQL query

'SELECT id FROM implementation WHERE name = "classif.rpart"'

without going to the website and typing it manually?

janvanrijn commented 11 years ago

Yes you can. There is a query API (mostly used by the frontend) that accepts any query and returns some sort of json answer. Param q = the query (preferably URL encoded. )

In your case this would be http://www.openml.org/api_query/?q=SELECT%20id%20FROM%20implementation%20WHERE%20name%20=%20%22classif.rpart%22

Please let me know if you are interested in something more robust, integrated in the current API.

berndbischl commented 11 years ago

I think we should really start defining what we formally mean by an implementation quite soon. Currently I see these scenarios:

Note, these are just some things I came up with in a few seconds of brainstorming, it is not supposed to be a formal onthology.

Also, we need to define what we really mean with the version nr of the algorithm. All of these definitions and possibly other stuff relating to the issue needs to be documented in one place for both developers and uploaders.

joaquinvanschoren commented 11 years ago

I prefer one simple definition, without many different scenarios. It needs to be easily explained to users. I remember we went back and forth between different options, but did we actually settle on a final solution? In any case, I've given it some thought, and here is a proposal:

Uploading

The cleanest solution would be to just upload the code that actually produces the results. For R, that means the script that reads the task, does whatever it wants and then uploads the results. For RM/KNIME, it is the workflow that includes operators/nodes for reading the task and uploading the result. The WEKA experimenter is a bit special, probably the best solution is to upload a java wrapper that starts the experiment (even if originally run from the GUI).

In all cases, the task id is an input, next to other inputs (parameter settings). The implementation description should include (I believe this is covered already):

A run is a file with results linked to the implementation_id and task_id (both returned by the server) and any additional parameter settings that are inputted to the implementation.

Downloading

The user who downloads the implementation should be able to easily run it on a new task. In the simplest case, she just input a new task_id and runs it. If there are other parameters, it should be clear from the implementation description how to set them.

Searching

Here is what I was most worried about during previous discussions: how can I search for the experiments with, e.g., libsvm and how do I compare libsvn with other algorithms? I must also know which versions of libsvn are used. The current way of doing it (tagging each implementation with a general name for the learning algorithm) is probably untenable.

The simplest thing to do would be to just 'dump' the workflow/script to text and index that. If I then search for 'libsvn', I will find all implementations, i.e. scripts/workflows (I still struggle with the term 'implementations' :)), that somehow mention libsvn. It is then up to the user to decide which implementations to select.

Additionally, I do see some benefit in using a more structured description for implementations:

We can then build a more powerful structured search and better 'present' the implementation to the user online.

Caveat: in theory, you could write an implementation that takes an algorithm name as an input parameter. Not sure how to make that searchable.

Versioning

About the version numbers: what is the question exactly? I believe it is best if the user can choose a name and version number during upload, but we always keep a checksum on the server so that no two different implementations are uploaded with the same name/version. On the server, we store implementations based on a unique numeric id, just like datasets. This id is referenced when you upload runs, and we offer an API call to get an id given a name/version combo.

Does that sound like a clear description? Maybe I slightly differ from the current specs.

Let me know.

Cheers, Joaquin

On 23 October 2013 03:08, berndbischl notifications@github.com wrote:

I think we should really start defining what we formally mean by an implementation quite soon. Currently I see these scenarios:

-

a library-provided algorithm that the user simply applied. Maybe he changed some parameters.

a combination of library-provided pieces that the user chose to put together. E.g. he used a feature filter, than an SVM.

a library-provided piece, extended with code code from the user. E.g. he wrote his own preprocessing, than applied an SVM.

a completely self-written, custom algorithm

Note, these are just some things I came up with in a few seconds of brainstorming, it is not supposed to be a formal onthology.

Also, we need to define what we really mean with the version nr of the algorithm. All of these definitions and possibly other stuff relating to the issue needs to be documented in one place for both developers and uploaders.

— Reply to this email directly or view it on GitHubhttps://github.com/openml/OpenML/issues/22#issuecomment-26872281 .

Dr. Ir. Joaquin Vanschoren

Leiden Institute of Advanced Computer Science (LIACS) Universiteit Leiden Niels Bohrweg 1, 2333 CA Leiden, The Netherlands mobile: (+32) (0)497 90 30 69

berndbischl commented 11 years ago

This thread has become increasingly complicated. Lets keep it open but discuss it very soon on Skype. Very soon = some time before Christmas.

joaquinvanschoren commented 11 years ago

I agree, this tread has gone off-topic, so I will close it and open a new one with the conclusion of our last Skype call.