mlr-org / mlr

Machine Learning in R
https://mlr.mlr-org.com
Other
1.64k stars 405 forks source link

gsoc-visualization #289

Closed zmjones closed 9 years ago

zmjones commented 9 years ago

In the near future (before May 5) I plan on refactoring plotROCRCurves to use plotROC which uses ggplot2 instead of base graphics. This package offers some extra functionality (compared to what is available now) which I'll document. I also hope to get at least one other smallish feature done by then.

One option would be extending plotLearnerPrediction to cases with 3 features. I think the two obvious things to do here are to use one of the 3D plotting packages (I think plot3Drgl is nice). Another thing I'd definitely like to do is to use facetting for the third feature. With a discrete feature this is easy but it might be nice to add the ability to discretize one of the features as well. We could also plot 4 features by using the background color as well. In general it would be possible to layer on additional features in this way but it seems to have diminishing returns in terms of interpretability after 2 or 3 features.

Another thing I could possibly do is to add an option to performance that lets you apply a measure to a subset of the feature space. I find this very useful for exploring the fit of learners, especially with data that is structured in some way. I haven't looked at the code for performance yet so i don't have an idea how much work that would entail. One problem i can see is that if some of the cells of the grouping are small the variance might be quite large. I am not sure whether that is out of the scope of the project. Is this is something others would like to have?

When I get back (around May 16-17) I would like to finish up any residual work from the above first. I'd like to talk to Julia/Lars/Bernd about what I do next. I've had my nose in the EDA related functionality lately and so my inclination is to start working on that first. Alternatively I could start work on producing interactive versions of the existing plotting functionality.

I have found some papers recently that I think are worth prioritizing above the minor things in my proposal (dependent data resampling methods and standard error estimates for random forests and other ensemble methods). In particular Hooker 2012 and Drummond and Holte 2006.

larskotthoff commented 9 years ago

For the multi-dimensional case of plotting learner predictions it may be worth having a look at multi-dimensional scaling.

aydindemircioglu commented 9 years ago

is mds still "hip"? t-sne is nice, though probably a bit harsh in running times, unless one uses a strange approximation.. (see https://github.com/oreillymedia/t-SNE-tutorial for the basic stuff)

larskotthoff commented 9 years ago

Don't know, but if it's easy to integrate I don't see a reason not to try it :)

aydindemircioglu commented 9 years ago

at least the O(N^2) version is very easy to integrate, as there is a package for that http://cran.r-project.org/web/packages/tsne/index.html there is also the barnes-hut approximation https://github.com/jkrijthe/Rtsne applying it should be as easy as mds. so it could be worthwhile to plug it in and to see if it is gives any better results than mds.

zmjones commented 9 years ago

Ok I had not thought of that. I'll add that to my to-do/reading list.

zmjones commented 9 years ago

I opened #290 with the plotROC addition. I figured that since this is a small discrete feature it made sense to issue a PR instead of having you all look at it in my fork. Is that all right for any "nearly" completed feature?

I'm also curious how I should do it if I've written some code but would like feedback on it. Just a link to the file in the forked repository? I feel as though diffs are somewhat helpful though.

larskotthoff commented 9 years ago

Thanks Zach, the pull request should be fine for this. For feedback on code you could open an issue in your forked repository.

zmjones commented 9 years ago

I think #290 broke a few important things and should be reverted. It doesn't handle output from ROCR::performance where meas1 != "tpr" or meas2 != "fpr".

I think that using plotROC was the wrong choice. Instead I am working on a ggplot2 implementation of ROCR::plot. I should have at least run everything through the tutorial section before issuing the PR. Sorry!

larskotthoff commented 9 years ago

Ok, I've reverted it. Could you please add unit test for the things that were broken to make sure that it doesn't happen again?

zmjones commented 9 years ago

Yes I will do that.

zmjones commented 9 years ago

The plotROC thing was a bust. I misread part of their documentation which said they could handle input from ROCR::performance when it could only handle the case when meas1 = "tpr" and meas2 ="fpr". I have since begun working on a ggplot2 implementation of ROCR::plot. My progress is here. It works but has some kinks that need to be fixed, tests written, docs added, etc. The functionality is a subset of what is available in ROCR::plot though. I haven't implemented downsampling, or the intervals the give for averaged curves. I did add the ability to do threshold/vertical/horizontal averaging though, and verified it worked on the code in the tutorial. What are your recommendations for how to continue with this? I am headed to Seattle now for my climbing trip (will be in Seattle a bit before I really am out of contact) and so will have only a little time to work on this, but I can do so.

I was wondering about plotLearnerPrediction. Right now it fits the specified learner with k < 3 predictors. Would it make more sense to instead be displaying partial dependence? I can see for pedagogical reasons that this would be useful, but less so for practical data analysis.

larskotthoff commented 9 years ago

Given that you want to use ggvis at some point, does it make sense to implement this from scratch with ggplot2? It sounds like finishing the ggplot2 implementation would be quite a bit more work, so it may make sense to go straight to ggvis and use this as a test case to figure out how to implement things with ggvis in general.

Regarding plotLearnerPrediction, what do you mean by "partial dependence"?

zmjones commented 9 years ago

I am not sure. I don't think it would take me that long. It took me a bit longer to do this first bit because of finals. You are probably right about making ggvis more of the focus though. I will iron out the last few bugs, write a more extensive set of tests, and then update the tutorial. In this case do you think it would make sense to allow the use of ROCR::plot optionally?

I just mean training the learner using all of the features and then taking the expectation of the prediction function wrt to the features not in the plot. So to estimate that you set one feature to a set of values, create a synthetic dataset where that feature is set to each value, predict, and then average over the synthetic data to get a predictor for each value of the feature. A better description is in ESL on pg. 369. I talked with Bernd about it a bit and it seemed like it would be a nice thing to have in mlr since it works with any supervised learner.

larskotthoff commented 9 years ago

Ok, if you think finishing the ggplot2 implementation won't be too much work, then go ahead. I'm not sure how much sense it would make to use ROCR::plot -- would you make it available through the same interface? What if somebody tries to use it for things it doesn't support?

Plotting predictions for fixed values of some features sounds interesting, but I would suggest that you first think about the general structure of plotLearnerPrediction and how this feature would integrate with the overall architecture. There needs to be a consistent way of specifying what to plot with regards to subsets of features/feature values. In particular, do all combinations make sense (at least potentially), i.e. would you want to specify the features and feature values to show (or the methods for getting features and feature values to show) as separate arguments, or would you want to specify a method for plotting (e.g. projection into 2 dimensional space, showing only the first two features, ...) and then any arguments to that?

berndbischl commented 9 years ago

Hi

1) Regarding the plotROCCurves in ggplot2. I like that and would like to see it finished. But @studerus started this here https://github.com/berndbischl/mlr/blob/master/todo-files/getROCCoords.R This is much more flexible for users who would like to get at the data behind the plot. What do you think? Should this be used so we are then also completely independent of any ROCR code?

2) The partial dependence plot would be nice to have next. I just wonder where it should be included. And whether we should make plotLearnerPrediction a "monster". Note that this method also trains the model right now, which we dont want for the pD-plot. You would need to pass a model. Maybe it is best to have a clearly named, new method?

larskotthoff commented 9 years ago

Zach, I've just had a call with Bernd about some fundamental/architectural issues. It would be good if we could have a call with you about this at some point -- when would you be available?

We've also discussed the ggvis vs. ggplot issue and it would be really helpful if you could have a look at the existing plotting functions (plot*) that use ggplot and see to what extent it would be possible to implement the same thing with ggvis.

schiffner commented 9 years ago

About the ROC curves:

In this case do you think it would make sense to allow the use of ROCR::plot optionally?

I'm not sure, do you refer to the possibility to use asROCRPrediction -> ROCR::performance -> ROCR::plot (way 2 in the tutorial)? I personally would keep this, as long as it is not a burden to maintain asROCRPrediction.

When we have the call and the fundamental stuff is clear, we should also make a list of what "nice-to-have features" plotROCRCurves should support. (For example isometrics might be nice or drawing points in ROC space for non-probabilistic classifiers.)

zmjones commented 9 years ago

Hello @berndbischl @larskotthoff and @schiffner. I am in a hostel in Seattle right now and am not sure the wifi is good enough to a video call but I could talk now if you like.

@berndbischl I looked at the implementation you linked to and it has the same issue that ploROC (the addition I made and then had to ask to have reverted). It looks as though it only handles the special case in which meas1 = "tpr" and meas2 = "fpr". I think it makes a lot of sense to use ROCR::performance, transform the output of that into a uniformly formatted dataframe (which I've done), and then output either the plot or the data. Of course since the ggplot objects have the data in them this is already complete, though we could make it an explicit option.

larskotthoff commented 9 years ago

@zmjones After you're back from your trip is fine as well -- I think now is a bit too late for the people in Germany. If you're planning on making it over to Vancouver I could provide you with sufficient bandwidth for a call :)

zmjones commented 9 years ago

oh are you in Vancouver? I didn't realize that! We are actually going to be in BC in a few days, but up in the mountains. We wanted to come to Vancouver but couldn't fit in the schedule. I must say the weather here is pretty awesome so far. Running on ~3 hours of sleep but it is very nice so far.

larskotthoff commented 9 years ago

Yes, I'm in Vancouver so if you do happen to make it over here, give me a shout.

berndbischl commented 9 years ago

@zmjones OK, I get the point about using ROCR to calculate the performances / coords. So we do it in 2 steps, a) get the coords from code from ROCR b) plot it with ggplot2

I guess than your code / approach is fine. I would only separate the two steps in 2 functions, at least as a possibility. So have the generation of the data.frame data an exported helper function. And then call this in the true plotting function. Does that make sense?

zmjones commented 9 years ago

I am in the process of traveling home as of tomorrow morning. Not sure if I'll make it back then or on Saturday (flying stand-by). Maybe we could check in via Google Hangouts on Saturday? I was planning on refactoring, testing, and then documenting/changing the tutorial for plotROCRCurves first.

larskotthoff commented 9 years ago

Saturday works for me, not sure about the folks in Europe.

berndbischl commented 9 years ago

I would join (maybe) briefly as my gf visits that day. Please plan without me but tell me the time so I can join briefly if possible.

larskotthoff commented 9 years ago

Well, I would prefer the afternoon, which would be too late for @berndbischl -- do we need to discuss anything this urgently (i.e. before next week)? I guess it's clear what's to do for plotROCRCurves?

schiffner commented 9 years ago

If there is nothing urgent I would prefer next week. (I will not be at home much this weekend. If you want to do a hangout, just plan without me, tell me the time and I will try to join.)

zmjones commented 9 years ago

Ok how about Monday then? I can get up early or stay up late as needed. I expect I can finish up that work on plotROCRCurves the day I get back so we can just talk about other things.

larskotthoff commented 9 years ago

Monday is good for me.

schiffner commented 9 years ago

Monday is ok for me, too.

zmjones commented 9 years ago

Any particular time preferences?

schiffner commented 9 years ago

Any time is ok for me.

larskotthoff commented 9 years ago

How about 10am PST/7pm German time?

zmjones commented 9 years ago

that is fine with me!

schiffner commented 9 years ago

Fine with me, too.

zmjones commented 9 years ago

Opened #307 and #308 for plotROCRCurves changes.

berndbischl commented 9 years ago

Sorry I was cooking indian food for too long and forgot :(

larskotthoff commented 9 years ago

No problem -- the quick summary is that Zach has finished the implementation of plotROCRCurves for now (and it's merged into master) and will look at converting some of the simpler visualizations to ggvis now.

berndbischl commented 9 years ago

ok thats perfect and what I wanted too

zmjones commented 9 years ago

So I've rewritten a few of the plot functions to use ggvis now. There are a few issues I thought it would be worth asking about.

There are not currently easy ways to do some basic stuff (e.g., geom_vline, titles), but you can work around these without too much trouble (so far anyhow).

facetting is not implemented and nothing similar (e.g., embedded plots) is currently available. apparently the latter will be at some point. That would limit (somewhat) the functionality currently in plotLearningCurve and potentially, some of the things I'd like to add to plotLearnerPrediction. I don't think that is a deal-breaker necessarily though.

ggvis makes heavy use of non-standard evaluation which gives notes (uninitialized global variables) with check. there are ways to hack around it (e.g. utils::globalVariables or initializing them to NULL), but nothing as nice as aes_string that I am aware of. I am still reading docs and so on, so I may be wrong about this.

edit: i think this can be dealt with now. things are a bit more verbose but it is a clean solution.

some of the interactive functionality is a bit strange (probably because i am not familiar with vega). you cannot, for example, attach tooltips to a line/path (e.g., for a roc curve, print the cutoff for any part of the line). what instead you have to do is insert points and attach tooltips to them (i hacked around this by adding points and making them totally transparent).

zmjones commented 9 years ago

I also have a Git/Travis question. To get my fork to be built by Travis I had to modify .travis.yml (replacing my email with the core dev's emails). From what I can tell to get a particular branch to build my email has to be in there (e.g., I can't just do it for master and get other branches to build). So when I go to issue a PR this change is in there. Is there an easy way for the merger to discard this commit? Another way to avoid this problem?

larskotthoff commented 9 years ago

I'm assuming that you're generating SVGs for the interactive things with ggvis. What you're trying to do isn't possible in SVG (at least not without significant additional hackery), so it's not a limitation of Vega as such. SVG operates on a DOM (very similar to HTML) and a line would correspond to a single element, so it can have only a single "action" associated with it. What you've done with the transparent points sounds ok to me; an alternative would be to break the line into separate individual lines (that connect and have the same colour so you can't actually see that they're separate) which then could have separate tooltips etc.

What is your general impression of Vega? Did you try generating PDFs or embedding in knitr documents?

Regarding Travis: Unfortunately there's no "clean" way to do this that I can see because Travis is quite inflexible when it comes to configuration. You could have a special "travis" branch in your repo that is forked from master (or whatever you're developing in) and has only the change of email address commit. Then you can rebase this branch after you've committed things and tell Travis to build only this one. That does involve the additional step of having to rebase every time you make a change you want to be checked, but maybe you could script this...

zmjones commented 9 years ago

Ok on SVG. The line segments might be better, I'll try that out.

I like the whole system pretty well so far, especially considering how new it is. I haven't used vega directly yet but plan on going through their tutorial soon. I will make sure everything works with knitr and the tutorial before I push anything. That will probably require some rewriting (e.g., plotLearningCurves).

I've now rewritten all of the plot methods (i could find anyhow) except for plotLearnerPrediction, which I'm about to start on.

The most substantial changes so far are to plotLearningCurve since I can't do the facetting. I am toying now with just mapping one or the other to another aesthetic, but I think it makes the graph too hard to read pretty quickly (as the number of learners or measures increases). Maybe I could do this mapping but warn or error when there are too many? Alternatively I could just require either multiple learners or multiple measures but not both. I assume this will only be a problem until embedded plots are available.

Makes sense on Travis. Although I was a bit clumsy in doing it what you suggested was what I was already trying to do. Not ideal though, and perhaps worth scripting.

When I am done with this are we going to merge all this into a ggvis branch in this repo? Or will it sit in my fork for a while?

larskotthoff commented 9 years ago

Do you think the facetting is something which could be done interactively? I.e. the user chooses what to show. Is that something that would be (easily) possible with ggvis? Also, what effort would you estimate for reimplementing what you've done so far (including background reading etc)?

I'm happy to merge into a ggivs branch in this repo, but I guess it doesn't really matter where it lives as long as it doesn't go into master. It may make more sense to keep it in your copy of the repo for now to avoid you having to prepare lots of pull requests.

zmjones commented 9 years ago

Ah that is a good idea. I guess I am still thinking statically! I think it is possible but we'll have to explicitly import shiny, and I will need to write a small shiny app to do it. That shouldn't be hard I don't think (1 day probably).

I would say the effort required is moderate, but I suspect that will go down. They have some nice vignettes but it is very much geared towards interactive use and it is still really new (so not many stackoverflow answers). also the plots we have are not very complicated.

zmjones commented 9 years ago

so i just finished a basic shiny app for plotLearningCurve that lets you map learners or measures to color, and has a sidebar that allows you to pick the other (learners or measures) interactively. here is a gist with the function. i haven't pushed it to my fork yet. if you all like this, then i'll go back and add this sort of thing to all of the functions. you can also output a static ggvis plot, from which you can extract the raw data.

zmjones commented 9 years ago

and here is the only code i've tested it with. but should give an idea of how it works.

library(mlr)
library(shiny)
r = generateLearningCurve(list("classif.rpart", "classif.knn"),
                                            task = sonar.task, percs = seq(0.2, 1, by = 0.2),
                                            measures = list(tpr, fpr, fn, fp), 
                                            resampling = makeResampleDesc(method =  "Subsample", iters = 5),
                                            show.info = FALSE)
plotLearningCurve(r, interactive = TRUE)
larskotthoff commented 9 years ago

Where does generateLearningCurve() come from? It might be good (especially for the interactive stuff) if you could set up some way of producing complete examples (e.g. by including example code and output with the gist).

zmjones commented 9 years ago

generateLearningCurve is unmodified: the version in mlr, which is in generateLearningCurve.R (which is also where plotLearningCurve is). I updated the gist with the little bit of example code (which is just from the example section of the docs for generateLearningCurve). Is that what you meant? I don't think I can include interactive output (other than screenshots) without having a server to use.

larskotthoff commented 9 years ago

Ah, apologies -- I was loading the wrong mlr version...

The example looks really nice. Is there any way to make the legend clickable so that you can select which lines to show?

For plotting multiple learners/measures statically, do you think it would make sense to simply generate all possible plots? I have no feeling for how fast ggvis is with PDF output, so I guess this may become a performance issue?

For the interactive stuff, something like http://bl.ocks.org/ would be really cool -- that may work out of the box already if you can generate the necessary files.