paris-saclay-cds / ramp-workflow

Toolkit for building predictive workflows on top of pydata (pandas, scikit-learn, pytorch, keras, etc.).
https://paris-saclay-cds.github.io/ramp-docs/
BSD 3-Clause "New" or "Revised" License
68 stars 43 forks source link

Collecting feature requests around a developmental feature for RAMP #250

Open kegl opened 3 years ago

kegl commented 3 years ago

When RAMP is used for developing models for a problem, we may want to tag certain versions of a submission, and even problem.py, together with the scores. One idea is to use git tags. For example, after running ramp-test ... --save-output, one could run another script that git adds problem.py, the submission files, and the scores in training_output/fold_<i>, commit and tag with a user-defined tag (plus maybe a prefix indicating that it is a scoring tag, so later we may automatically search for all such tags).

zhangJianfeng commented 3 years ago
  1. When loading the data in ramp, seems training data will be read twice. When the data is big, it is a bit slow.
  2. Is it possible to parallelize the CV process?
gabriel-hurtado commented 3 years ago

Adding on feature that I would be useful, at least to me: it would be great to have the ability to import more code from elsewhere in a submission, allowing multiple submissions to share some code. Now it can be done by creating a library and importing it, which is a bit tedious. @albertcthomas mentioned this could perhaps be done on a similar way that pytest does it. They have a conftest.py file for code that you want to reuse for different test module.

181

albertcthomas commented 3 years ago

@albertcthomas mentioned this could perhaps be done on a similar way that pytest does it. They have a conftest.py file for code that you want to reuse for different test module.

Well it is more like "this makes me think of conftest.py that can be used to share fixtures" but I don't know what happens when you run pytest and I am not sure the comparison goes very far :). As written in pytest doc "The next example puts the fixture function into a separate conftest.py file so that tests from multiple test modules in the directory can access the fixture function". This feature is discussed in issue #181.

illyyne commented 3 years ago

1- I find the step of reading data is taking too much time: slower than reading it without RAMP. 2- It would be great if also the mean result is saved with the bagged one. 3- Propose a latex syntax for the results. 4- When the output is saved, it would be better to save also the experiment conditions: like data label, tested hyperparameter, etc and keep all somewhere either locally or in the cloud to check it later.

LudoHackathon commented 3 years ago

Here are some features that could help:

LudoHackathon commented 3 years ago

From my (little) experience with RAMP, what made people a bit reluctant to use it was that it was too high level. Mearning that we don't see the classical sequential process we are used to see in a ML script (load data, instantiate model, train it, test it). As an example, Keras (not the same purpose as RAMP) embedded some part of the script to minimize the main script but kept the overall spirit of the classical script making it as understandable as the original one. Using ramp-test in command line may make RAMP more obscure to new users. Maybe that having a small script (as the one already in the documentation for example) giving the user a more pythonic way to play with it, without having to use ramp-test as a command line, could make machine learners more willing to use it.

agramfort commented 3 years ago

I have heard this many times too. Debugging is a pain etc. To fix this now I stick to RAMP kits where you need to return a sklearn estimator that implements fit and predict so you can replace ramp-test by sklearn cross_val_score and just use your favorite env to inspect / debug / run (vscode, notebook, google colab etc.)

kegl commented 3 years ago

Calling ramp-test from a notebook is as simple as

from rampwf.utils import assert_submission
assert_submission(submission='starting_kit')

This page https://paris-saclay-cds.github.io/ramp-docs/ramp-workflow/advanced/scoring.html now contains two code snippets that you can use to call lower-level elements of the workflow and emulate a simple train/test and cross-validation loop. @LudoHackathon do you have a suggestion what else would be useful? E.g. an example notebook in the library?

agramfort commented 3 years ago

the doc says:

trained_workflow = problem.workflow.train_submission( 'submissions/starting_kit', X_train, y_train)

after all these years I did not know this :'(

this should be explained in the kits to save some pain to students

albertcthomas commented 3 years ago

this should be explained in the kits to save some pain to students

wasn't this the purpose of the "Working in the notebook" section of the old titanic notebook starting kit?

kegl commented 3 years ago

Yes, @albertcthomas is right, but the snippet in the doc is cleaner now. I'm doing this decomposition in every kit now, see for example line 36 here https://github.com/ramp-kits/optical_network_modelling/blob/master/optical_network_modelling_starting_kit.ipynb. This snippet is even simpler than in the doc but less general, only works when the Predictions class does nothing with the input numpy array, which is most of the time (regression and classification). Feel free to reuse.

albertcthomas commented 3 years ago

This page https://paris-saclay-cds.github.io/ramp-docs/ramp-workflow/advanced/scoring.html now contains two code snippets that you can use to call lower-level elements of the workflow and emulate a simple train/test and cross-validation loop. @LudoHackathon do you have a suggestion what else would be useful? E.g. an example notebook in the library?

The page is doing a good job at showing how you can call the different elements (and thus play with them, doing plots....)

  1. for better visibility we might clearly say that there is a command-line interface based on ramp-test and a way of calling the neededs function easily in a python script (or notebook). Of course we could add an example showing the python script interface.

  2. More importantly, maybe think of what can break when you go from one to the other interface. For instance imports from other modules located in the current working directory. This still forces us/the students to work with submission files. I think that using the "scikit-learn kits" eases the transfer of your scikit-learn estimator from your playing python script/notebook to a submission file and making sure that this works in most cases. I let @agramfort confirm this :)

  3. Instead of

    from rampwf.utils import assert_submission
    assert_submission(submission='starting_kit')

    we could have something like

    from rampwf import ramp_test
    ramp_test(submission='starting_kit')
  4. Debugging is a pain etc.

For debugging with the command line I have to say that I rely a lot on adding a breakpoint where I want to enter the debugger. However, this cannot be done post-mortem compared to using %debug in ipython or jupyter. For this we could have a --pdb or --trace flag as in pytest. But it's true that it's easier to try things and play with your models/pipelines when not using the command-line.

albertcthomas commented 3 years ago

use your favorite env to inspect / debug / run (vscode, notebook, google colab etc.) giving the user a more pythonic way to play with it, without having to use ramp-test as a command line

this is an important point. 2 or 3 years ago I was rarely using the command-line and I always preferred staying in a python environment. Users should be able to use their favorite tool to play with their models and we should make sure that at the end it will work when calling ramp-test in the command line.

kegl commented 3 years ago
  1. OK
  2. no comment
  3. OK. In fact we may put in focus the python call and tell them to use the command line ramp-test as a final unit test, the same way as one would use pytest. I think the cleanest way would be to have ramp_test defined in https://github.com/paris-saclay-cds/ramp-workflow/blob/advanced/rampwf/utils/cli/testing.py and main would just call ramp_test with the exact same signature. In this way it's certain that the two calls do the same thing.
  4. I prefer not adding the command line feature if everything can be done from the python call.
albertcthomas commented 3 years ago
3\. I prefer not adding the command line feature if everything can be done from the python call.

is this for 4. and --pdb?

agramfort commented 3 years ago

doing:

import imp feature_extractor = imp.load_source( '', 'submissions/starting_kit/feature_extractor.py') fe = feature_extractor. FeatureExtractor() classifier = imp.load_source( '', 'submissions/starting_kit/classifier.py') clf = classifier.Classifier()

is to me too complex and should be avoided. We have a way suggested by @kegl based on the ramwf function.

now I agree with @albertcthomas leaving the notebooks to edit python files is a bit error prone.

what I have shown to students is to use the %%file magic to write a cell to the file on the disk.

anyway I think we should show in each notebook what is the easy way.

ramp-test command is an easy for us to know that it works on their systems but not the more agile way when they need to come up with their own solution.

kegl commented 3 years ago

import imp feature_extractor = imp.load_source( '', 'submissions/starting_kit/feature_extractor.py') fe = feature_extractor, FeatureExtractor() classifier = imp.load_source( 'submissions/starting_kit/classifier.py') clf = classifier.Classifier() is to me too complex and should be avoided. We have a way suggested by @kegl based on the ramwf function.

I'm not sure what you mean here. We're using import_module_from_source now.

agramfort commented 3 years ago

I copied these lines from the titanic starting kit which is used to get student started on RAMP.

kegl commented 3 years ago
3\. I prefer not adding the command line feature if everything can be done from the python call.

is this for 4. and --pdb?

yes

gabriel-hurtado commented 3 years ago

Another feature that would be nice to have : have an option to separate what is saved and what is printed to the console. This would allow to save extensive metrics without flooding the terminal.

kegl commented 3 years ago

Partial fit for models where eg. number of trees or number of epochs is a hyper. This would be mainly a feature used by hyperopt (killing trainings early) but maybe also useful as CLI param.

kegl commented 3 years ago

Standardized latex tables computed out of saved scores. Probably two steps: first create all scores (of selected submissions and data labels) into a well-designed pandas table. Then a set of tools to create latex tables, scores with CI and also paired tests. I especially like the plots and score presentation in https://link.springer.com/article/10.1007/s10994-018-5724-2.

albertcthomas commented 3 years ago

When RAMP is used for developing models for a problem, we may want to tag certain versions of a submission, and even problem.py, together with the scores. One idea is to use git tags. For example, after running ramp-test ... --save-output, one could run another script that git adds problem.py, the submission files, and the scores in training_output/fold_<i>, commit and tag with a user-defined tag (plus maybe a prefix indicating that it is a scoring tag, so later we may automatically search for all such tags).

would be great to have a look at MLflow, @agramfort pointed it out to me. There are some parts that we could use, for instance the tracking one

martin1tab commented 3 years ago
  1. When loading the data in ramp, seems training data will be read twice. When the data is big, it is a bit slow.
  2. Is it possible to parallelize the CV process?
  1. yes, training data is read twice for the moment since X_train, y_train, X_test, y_test = assert_data( ramp_kit_dir, ramp_data_dir, data_label) is called twice in the testing.py module. Same issue appears with the 'problem' variable, which is called 5 times.
    It is possible to fix these issues by making the testing module object oriented, then attributes corresponding to each of theses variables, (X_train, X_test,...) could be created and we would not need to repeat calls for some functions. But do we agree to add more object oriented code ?

  2. yes it is