mlr-org / mlr

Machine Learning in R
https://mlr.mlr-org.com
Other
1.64k stars 404 forks source link

gsoc-visualization #289

Closed zmjones closed 9 years ago

zmjones commented 9 years ago

In the near future (before May 5) I plan on refactoring plotROCRCurves to use plotROC which uses ggplot2 instead of base graphics. This package offers some extra functionality (compared to what is available now) which I'll document. I also hope to get at least one other smallish feature done by then.

One option would be extending plotLearnerPrediction to cases with 3 features. I think the two obvious things to do here are to use one of the 3D plotting packages (I think plot3Drgl is nice). Another thing I'd definitely like to do is to use facetting for the third feature. With a discrete feature this is easy but it might be nice to add the ability to discretize one of the features as well. We could also plot 4 features by using the background color as well. In general it would be possible to layer on additional features in this way but it seems to have diminishing returns in terms of interpretability after 2 or 3 features.

Another thing I could possibly do is to add an option to performance that lets you apply a measure to a subset of the feature space. I find this very useful for exploring the fit of learners, especially with data that is structured in some way. I haven't looked at the code for performance yet so i don't have an idea how much work that would entail. One problem i can see is that if some of the cells of the grouping are small the variance might be quite large. I am not sure whether that is out of the scope of the project. Is this is something others would like to have?

When I get back (around May 16-17) I would like to finish up any residual work from the above first. I'd like to talk to Julia/Lars/Bernd about what I do next. I've had my nose in the EDA related functionality lately and so my inclination is to start working on that first. Alternatively I could start work on producing interactive versions of the existing plotting functionality.

I have found some papers recently that I think are worth prioritizing above the minor things in my proposal (dependent data resampling methods and standard error estimates for random forests and other ensemble methods). In particular Hooker 2012 and Drummond and Holte 2006.

larskotthoff commented 9 years ago

Absolutely. Also keep in mind the ggvis stuff; in particular whether using ggvis would make it significantly easier/harder to provide customisation points.

berndbischl commented 9 years ago

Regarding importing ggplot2 then depending on it: IIRC we did this the first time, and there was a problem an I had to change it to pass CHECK

berndbischl commented 9 years ago

Would it make sense to return the ggplot object from the functions to allow the plot to be customised in the usual ggplot fashion?

We already do that for exactly that reason?

berndbischl commented 9 years ago

Further comments on the additional arguments like linesize / pointsize and so on:

I am open to any kind of discussion to streamline the function signatures and remove "clutter", but

a) The plot needs to look in its defaults for "normal use cases"

b) Stuff like pointsize must be easiliy changeable, as this is what happens most of the time.

This is why I included it in the signature currently.

zmjones commented 9 years ago

i think the ggplot2 defaults do a pretty good job of (a) and (b) is easily changeable without having an argument to the function. so if we call geom_point() and the user wants geom_point(size = 50), the user calls our function and + geom_point(size = 50) which will replace the layer from the call to our function. the only time this doesn't work is when the layer uses data not available to the user, but we can just make it so that is never the case.

zmjones commented 9 years ago

I would like to modify getFilterValues so that it can take multiple methods, which can in turn be visualized with plotFilterValues. I've modified the function so it can do that but I am not sure of the best way to structure the returned object. I was thinking that it should be the same as it is now except instead of method and data, data is a named list of data.frames with the names corresponding to the methods.

schiffner commented 9 years ago

Thanks. Structuring the return object as you suggest sounds reasonable to me.

zmjones commented 9 years ago

Ok did that and issued a PR. It required only minor downstream changes. Are there any objections to doing a similar thing for performance/plotPerfVsThresh before I move on to plotLearnerPrediction? If anyone has ideas for other things that would benefit from plotting (esp. interactive) I'm all ears.

schiffner commented 9 years ago

Thanks for the PR. I am going to have a look now.

What exactly are your plans for performance/plotPerfVsThresh?

zmjones commented 9 years ago

I was just going to change it so that you could pass a list of prediction objects to performance and then plot them using facetting w/ the ggplot2 version of plotPerfVsThresh and the Shiny interactive stuff for the ggvs version.

larskotthoff commented 9 years ago

I'm wondering whether this is something that should be implemented in the data generation function. It seems to me that the same functionality could be achieved by leaving the data generation functions as they are and modifying the plot functions to take a list of data frames in addition to a single one. This would make the implementation of the data generation functions less complex and allow for more flexibility in the plots (i.e. you don't have to rerun the data generation if you want to change the list of things to show).

zmjones commented 9 years ago

At least in the case of getFilterValues the implementation isn't very complex I don't think. Maybe we could do it both ways? Anyhow I'll think more about this.

zmjones commented 9 years ago

So I definitely see our point but am still not sure. I thought that maybe in the future filter choice might be a tuning parameter and then you'd want it set up this way. It would be fairly easy to do it your way. If you think this should definitely be the way we go just let me know and I'll cancel the PR and rework it a bit.

larskotthoff commented 9 years ago

It sounds like changing this later wouldn't be too much work, so I'm happy to merge the PR.

berndbischl commented 9 years ago

Hang on I just discussed this with Zach. IMHO we should have a single data.frame instead of a list for multiple selected filter methods.

schiffner commented 9 years ago

But you are ok with extending getFilterValues for a vector of filter methods, Bernd?

Here are my 2 cents oncerning performance/plotThreshVsPerf:

Since we decided to separate plotting and generating the data to be plotted, and to make things more consistent as they are now, I would suggest to split plotThreshVsPerf into

generateThreshVsPerfData could then be extended to take a (list of) Prediction, (list of) ResampleResult, BenchmarkResult as input, just like generateROCRCurvesData.

I tend to leave performance as it is.

larskotthoff commented 9 years ago

@berndbischl What's the argument for having everything in a single data frame instead of a list?

berndbischl commented 9 years ago

But you are ok with extending getFilterValues for a vector of filter methods, Bernd?

I guess so.

berndbischl commented 9 years ago

What's the argument for having everything in a single data frame instead of a list?

Much easier to use and less redundant. In 90% of the time you would merge the dfs in the list. Now you also need to check the order of the features, or resort them, ... and so on.

zmjones commented 9 years ago

Hopefully this evening I will have time to fix a deficiency with the functionality I've added. In all the plots that map data (learners, measures) to aesthetics the user needs to specify which pieces of data get mapped to what. So for example, with plotThreshVsPerf, currently, measures get mapped to color and learners get facetted (or made interactive w/ ggvis). When the scales of the measures differ a lot it would be better to instead facet on the measures and color the learners.

zmjones commented 9 years ago

324 should fix this. I think the only plots where this makes sense are plotThreshVsPerf and plotLearningCurve, and their ggvis versions.

zmjones commented 9 years ago

I've updated the tutorial with most of the stuff I've done over the past few weeks and am now working on generatePartialPredictionData. Which is just a basic partial dependence function which takes a WrappedModel as input.

I plan on removing all of the computation from plotLearnerPrediction. I think that if someone wants to plot a learner trained on a few features they should pass a prediction object to plotLearnerPrediction.

berndbischl commented 9 years ago

I plan on removing all of the computation from plotLearnerPrediction. I think that if someone wants to plot a learner trained on a few features they should pass a prediction object to plotLearnerPrediction.

Probably true. For the very lazy you could still allow to pass a learner, so we dont break stuff too much

zmjones commented 9 years ago

Yea sure I could do that.

larskotthoff commented 9 years ago

Or provide a wrapper function that does both which we could keep :)

berndbischl commented 9 years ago

We should discuss the interface here before stuff gets implemented please.

larskotthoff commented 9 years ago

Sure, what time is good for a meeting this week?

berndbischl commented 9 years ago

Now to be done with it?

zmjones commented 9 years ago

https://gist.github.com/zmjones/63d21d0308752755a3ae

zmjones commented 9 years ago

So idk why this didn't occur to me yesterday, but the problem with giving an argument fun which summarizes the distribution at each point in the prediction grid, is unless the set of possible functions is restricted so that I can write down what would need to be done in a plot with each piece, you are left with some arbitrary output with no way to automatically plot it. Maybe I should require that the function output a measure of central tendency and then an upper and lower bound, with the option to just use the measure of central tendency?

berndbischl commented 9 years ago

Yes. Lets not make it more difficult than it has to be. so either one function, or three.

zmjones commented 9 years ago

ok and the default is quantile(x, c(.025, .5, .975)

berndbischl commented 9 years ago

so: extract = list(location = mean, lower = function(x) quantile(x, 0.5,) upper = max) or extract = list(location = median) could be user inputs

berndbischl commented 9 years ago

and choose a better name then extract

berndbischl commented 9 years ago

well you also dont need to bury the args in a list. just have three well named args instead.....

berndbischl commented 9 years ago

arghhh :) OK you want to pass just a simple function, that either returns 1 or 3 values. Your solution is best :)

zmjones commented 9 years ago

:)

so idk how informative bounds constructed in this way are. unless the feature is very important any summary of the distribution of predictions at each step is going to be somewhat close to the distribution of the predictions without the manipulation (setting the feature to be some value). the bounds on the examples i have are all really big.

zmjones commented 9 years ago

maybe it would be better to have the user pass an arbitrary location measure, and then for some learners where we can we can provide a se argument, e.g. for random forests.

larskotthoff commented 9 years ago

I'm fine with restricting the function to return 3 numbers (lower bound, location, upper bound). You're right that the bounds won't be meaningful in all cases, but then the user doesn't have to use this functionality (i.e. if the provided function returns only a single value, use that as the location without bounds).

zmjones commented 9 years ago

ok. can you think of an example where it would be informative?

berndbischl commented 9 years ago

Isnt it informative in your examples? If what you say is true, that the bounds are really "untight" then the plot is kinda misleading?

berndbischl commented 9 years ago

Well, not maybe misleading.

zmjones commented 9 years ago

no it is definitely not informative.

ex

berndbischl commented 9 years ago

hmm lol ok.

zmjones commented 9 years ago

that is from bh.task with the mg feature.

zmjones commented 9 years ago

and those are the .025 and .975 quantiles, the median is the black line

berndbischl commented 9 years ago

ok, how about this. let the user just provide a function with 1 val, with a measure of location. For now. The rest we do when we see the need and understand what it means statistically.

Sorry, I had a bad idea I guess.

zmjones commented 9 years ago

well i agreed with it so can't fault you for that! this just didn't occur to me. i do think it is nice to have some sort of prediction variance estimation if possible. i know it is down on the priority list but i am going to at least implement the rf technique i know of, and there was a paper on this for svm that someone posted on my proposal.

berndbischl commented 9 years ago

Having predictive variance info (not wrt to the plot here, but in general) is pretty high on my personal list, as a general option in the learner.

larskotthoff commented 9 years ago

Hang on, I don't really see the problem here -- are you saying that we should drop this feature because it doesn't work out of the box for a few examples? I don't think that's a valid reason; it's not saying that it will never be useful. Even in this case it may become more informative if you choose tighter bounds.

Is it difficult to make the implementation of the bounds feature work with everything else? That would be the only reason for me to drop it.