mlr-org / mlr

Machine Learning in R
https://mlr.mlr-org.com
Other
1.64k stars 405 forks source link

gsoc-visualization #289

Closed zmjones closed 9 years ago

zmjones commented 9 years ago

In the near future (before May 5) I plan on refactoring plotROCRCurves to use plotROC which uses ggplot2 instead of base graphics. This package offers some extra functionality (compared to what is available now) which I'll document. I also hope to get at least one other smallish feature done by then.

One option would be extending plotLearnerPrediction to cases with 3 features. I think the two obvious things to do here are to use one of the 3D plotting packages (I think plot3Drgl is nice). Another thing I'd definitely like to do is to use facetting for the third feature. With a discrete feature this is easy but it might be nice to add the ability to discretize one of the features as well. We could also plot 4 features by using the background color as well. In general it would be possible to layer on additional features in this way but it seems to have diminishing returns in terms of interpretability after 2 or 3 features.

Another thing I could possibly do is to add an option to performance that lets you apply a measure to a subset of the feature space. I find this very useful for exploring the fit of learners, especially with data that is structured in some way. I haven't looked at the code for performance yet so i don't have an idea how much work that would entail. One problem i can see is that if some of the cells of the grouping are small the variance might be quite large. I am not sure whether that is out of the scope of the project. Is this is something others would like to have?

When I get back (around May 16-17) I would like to finish up any residual work from the above first. I'd like to talk to Julia/Lars/Bernd about what I do next. I've had my nose in the EDA related functionality lately and so my inclination is to start working on that first. Alternatively I could start work on producing interactive versions of the existing plotting functionality.

I have found some papers recently that I think are worth prioritizing above the minor things in my proposal (dependent data resampling methods and standard error estimates for random forests and other ensemble methods). In particular Hooker 2012 and Drummond and Holte 2006.

larskotthoff commented 9 years ago

In the absolute worst case, set up a Linux virtual machine with R and everything else in it.

mllg commented 9 years ago

@zmjones can you provide a traceback()?

berndbischl commented 9 years ago

Get on hangout later today, we will show you how to find the error. this cannot be hard. But we have the useR tutorial now :)

jakob-r commented 9 years ago

@zmjones As @mllg mentioned: call traceback() directly after the error occurs. I think it might be something trivial as some issue in your namespace and you are sure that you haven't used your own mlr fork :wink: ?

zmjones commented 9 years ago

turns out it was my .RProfile. Specifically one of these

options(parallelMap.default.cpus = 8,
         parallelMap.default.mode = "multicore",
         parallelMap.default.show.info = FALSE)
zmjones commented 9 years ago

@mllg and @jakob-r that is how I eventually remembered this. The traceback took me to doResampleIteration, figured it was some problem with parallelMap, and then it occurred to me that I have an .RProfile which sometimes ruins my life!

zmjones commented 9 years ago

@larskotthoff i actually have a ubuntu vm that i tested this in late last night as well. definitely a good sanity check.

schiffner commented 9 years ago

Glad to hear that it's working now!

schiffner commented 9 years ago

By the way, thank you very much for the tutorial section on partial dependency plots!!! Looks great.

zmjones commented 9 years ago

@schiffner Yep. Let me know if you want me to make any changes. I am going to work on the tutorial some more today.

schiffner commented 9 years ago

Thanks. I just read the partial dependence section again. I really like it and have nothing to add. I already did some proofreading this morning and changed only minor stuff. I also added the new section to the pages configuration in mkdocs.yml.

If you want to add a new section today please also put it into mkdocs.yml. That wasn't made very clear in the mlr-tutorial README. I reworked it a bit this morning to clarify what one needs to do to edit a section, add a section, what is done by Travis and so on. Please let me know if anything is missing.

What are your plans today for the tutorial, and in general? I found a reference to a section plot.md?

zmjones commented 9 years ago

Yea I missed that and was going to ask about it. Thanks for the clarification.

So Bernd asked me late last week to add a short plot page which referenced all the plots and talked briefly about the usage pattern. I think he might have wanted me to split out plotvsthresh, so I was going to look to see if that made sense. What do you think?

schiffner commented 9 years ago

I think the plot page is a good idea.

About the extra-section about plotThreshVsPerf: I'm not sure. What are your/Bernd's plans for this section?

I will open up an issue in the mlr-tutorial repo regarding restructuring.

zmjones commented 9 years ago

Oh ok, didn't know that. I will wait on it then. Will check out the new issue. Thanks!

schiffner commented 9 years ago

Ok, I'm on it. When you are working on the tutorial today, just put everything in Advanced.

zmjones commented 9 years ago

Ok will do.

larskotthoff commented 9 years ago

What's our general opinion on where the visualization stuff should go? I tend towards including relevant plots on pages that describe the concepts rather than having a separate visualization page. Does anything not fit this model?

zmjones commented 9 years ago

I don't disagree with that, and made the same point to Bernd. He thought there should still be a short separate page. I was thinking it would be more like an index than a full page. This all depends on @schiffner's reorganization plans though.

schiffner commented 9 years ago

I don't know if there is a general opinion. I opened an issue about the tutorial structure here: https://github.com/mlr-org/mlr-tutorial/issues/6

I guess it depends on how complex the plots are. For example: We have plotLearnerPrediction on the Predict page, which is fine with me, but I wouldn't want to have partial dependence plots on this page, because it's more complicated. I'm not opposed to redundancy, i.e. mentioning plots in two places, if it helps the reader.

Concerning the page about the plots: I think that it can't hurt to make clear the generate data -- plot data usage pattern and show 1,2 examples.

berndbischl commented 9 years ago

My opinion is this, and this is what I discussed with Zach:

There are a couple of different kinds of plots, and ways to talk about them in the tutorial.

a) Plots that a pretty general and stand on their own. They get their own tutorial page. Like part. dep. plots.

b) Either smaller stuff, not so important plots, or plots that really make sense in connection to other API mlr stuff, like plotThreshVsPerf.

This is also our current structure and I dont want to (or better: Julia should not have to) change everything.

What I told Zach to do, so all plots can be found most easily, is to build one extra page, that simply names and references all plots in the tutorial.

berndbischl commented 9 years ago

On the plotVsThread stuff: If Julia is already working on this, and has some unpushed stuff, Zach should wait, but could later iterate over it briefly?

schiffner commented 9 years ago

@berndbischl: Thx for the clarification.
Thanks to @zmjones there is now a section about mlr's plots in tutorial: http://mlr-org.github.io/mlr-tutorial/devel/html/visualization/index.html
About the threshld-plotVsThresh section: Are you ok with the contents I listed above? What were your plans? Will try to finish/push this soon.

zmjones commented 9 years ago

I have been working on the generic permutation importance today.

I just made this another filter.

You would pass a learner, a task, a measure (only one I think would be best), a contrast function, an aggregation function, and the number of permutations to conduct. Do you all think that is too much control or too little?

Related to this is local feature importance, which I didn't really think fit here. This is just the contrast between permuted/unpermuted predictions by observation. For some EDA work I have done I've used a smoother to get an idea of where in the distribution of the target feature the feature was most important. I thought that was a somewhat neat insight and could be generic as well. Maybe generateLocalImportance could be a thing I do later.

@larskotthoff sorry for lagging behind so much on testing of the plots. :/

larskotthoff commented 9 years ago

@zmjones Sounds good.

zmjones commented 9 years ago

I talked to hadley the other day on twitter and I think that the saving to svg and then checking the xml is the only feasible idea for testing that ggplot objects draw correctly. I wonder if this shouldn't be a separate package though. I don't know enough about the svg spec to do it yet but am reading now.

larskotthoff commented 9 years ago

I think it'll be fairly simple. The main point of the test would be to verify the presence and attributes of some elements, e.g. for 10 data points in there should be 10 circles. You could approach this purely from an XML point of view (ignoring that it is also SVG) as the point is not to validate the produced SVG. I'm not really familiar with XML processing in R, but finding all elements of a certain type and counting them shouldn't be more than 1-2 lines of code.

Do you have something in mind you want to test that would require more complex operations?

zmjones commented 9 years ago

Yea that would be simple I agree. I was thinking more like verifying the positions of things. But now that I think about it the simple version would probably be enough to catch almost all of the things that might go wrong.

larskotthoff commented 9 years ago

I don't think we should even try to verify the position of things, at least not in absolute terms. They depend on the size of the canvas and a test on a small canvas may "look" fine, but break for larger sizes. Also, this is highly subjective. I would only attempt to verify correctness in the sense that all data that should be represented in the plot is represented and similar high-level things.

zmjones commented 9 years ago

Ok well that part of svg I understand well enough to start on.

mllg commented 9 years ago

Hi Zach,

I think you mistake assert with stopifnot. Here is an example from your code:

 checkmate::assert(interaction %in% obj$features)

This does not throw an informative error message. The assert function is used to combine multiple checkSomething() functions, see the apparently imperfect documentation.

Besides that: Nice code! Keep the good work up!

Michel

zmjones commented 9 years ago

Yea you are right that I confused that. I take it I should just use stopifnot in that case then?

berndbischl commented 9 years ago

No, we dont use stopifnot. we have assertChoice. That is what you want to do here or? If an assert* does really not exist, use an "if + a readble error message". But you really should not have to do this often.

zmjones commented 9 years ago

I am in the process of fixing all of this but am not sure the best way to check this one.

In generateThreshVsPerfData.BenchmarkResult I want to check that the elements of getBMRPredictions all have predict.type = "prob".

Something like the following?

assert(all(sapply(extractSubList(obj, "predict.type"), function(x) assertChoice(x, "prob")))

or assertSetEqual(extractSubList(obj, "predict.type"), rep("prob", length(obj)))

cc @mllg

zmjones commented 9 years ago

I am a bit lost on the svg testing. I read through the spec, but the svg files generated by grDevices::svg via ggsave) doesn't seem to fit with that. In particular there are lots of references to particular glyphs for which I cannot seem to find a definition. I'm also having trouble distinguishing (in the XML) what are plot elements we want to test and what are background elements (e.g. the plot grid).

berndbischl commented 9 years ago

Dont let this slow down your progress too much. If this is hard to do, test what you can with normal testthat (e.g. that code at least runs). For the rest look at plots manually. I guess you have to do that anyway to a certain extend

zmjones commented 9 years ago

Ok. I still agree with @larskotthoff that this would be nice to have, as there are instances when the plot draws but is not correct. Right now I test the generation functions well enough, but only that the plot functions generate a plot that does render.

berndbischl commented 9 years ago

I 100% agree that this would be very nice to have....

larskotthoff commented 9 years ago

Hmm, I'll have a look when I get a chance and try to come up with something.

larskotthoff commented 9 years ago

Ok, here's what I'm thinking of:

library(ggvis)
library(XML)

p = mtcars %>% ggvis(~wt, ~mpg) %>% layer_points()
export_svg(p, "/tmp/test.svg")

doc = xmlParse("/tmp/test.svg")
print(nrow(mtcars))
print(length(getNodeSet(doc, "//svg:g[@class='type-symbol']/svg:path", "svg")))

This is checking that there's a symbol (which are unfortunately generic path and not circle elements) for each row in the original data frame. "//svg:g[@class='type-symbol']/svg:path" is an XPath expression that selects all path elements directly underneath a g element that has a class of type-symbol (which is how vega designates the data layer). Figuring out the specific structure to check for each type of plot may be a bit of work first, but should be straightforward by just looking at the generated SVG in a text editor.

zmjones commented 9 years ago

Ok that is what I was trying to do. I am still not following how to do this. I haven't even tried it with ggvis yet, just ggplot2. I didn't see any obvious correspondence between the XML and the plot (in the ggplot2 svg file), though obviously there is one.

Can you point me to where in the Vega docs this is written?

larskotthoff commented 9 years ago

Right, the ggplot2 SVG looks much uglier. Looks like there's probably a correlation between the glyphs and elements to plot.

I didn't even have a look at the Vega docs (and I don't think that this is documented). Just have a look at the generated SVG -- if you load it in a browser and then right click -> "Inspect element" it will show you what part of the source corresponds to each element.

zmjones commented 9 years ago

Ah duh. I should have known to do that. I am not sure of the utility of testing a ggvis plot like that though, as most are embedded in Shiny apps. I looked at RSelenium which you pointed out to me, but (another apparent duh on my part) we'd need two R processes to do any testing of the app, since the R process running the app is locked. I can do that locally of course but don't know if that is possible on Travis, or how to trigger that automatically from within R whilst running the tests.

I think if I could find out what the glyphs refer to in the ggplot2 svg the xml solution might work.

One thing I haven't done though is to test the data element of the plot object (for ggplot2). I can't imagine a situation in which the plot prints without errors/warnings and no error is detectable by looking at this unless we happen across a bug in ggplot2 itself. Do you agree with this? What sort of bugs could the xml solution catch but wouldn't generate errors otherwise?

larskotthoff commented 9 years ago

The aim would be to catch incorrect settings for the plot parameters, e.g. fill/line colour being set to the wrong thing. This wouldn't generate any errors or warnings, but an incorrect plot. It hopefully won't be too hard to check if there are things with n different colours in the generated plot. Another thing would be labels. Essentially anything that the user can't change themselves and that would cause the plot to be broken in some way if it's missing.

For the actual testing I would get the plot part out of the shiny app and check that (if that's possible without too much hassle).

zmjones commented 9 years ago

Ok makes sense. So I can't seem to write svg to file from a the running app yet but I think it is possible. I guess if I can do that then we can skip the whole RSelenium thing and just test that each set of inputs gives the output we want.

larskotthoff commented 9 years ago

Yes, I think that would be preferable. We can talk on hangout about this later today or tomorrow.

zmjones commented 9 years ago

Ok cool. I am feeling pretty terrible now so how about tomorrow?

larskotthoff commented 9 years ago

Sure, I'm pretty much free tomorrow. Just let me know on hangouts.

zmjones commented 9 years ago

Hey @larskotthoff I can do a hangout anytime now.

zmjones commented 9 years ago

I have tests done now for plotPartialPrediction. Is it ok if I put XML in suggests but not explicitly load it in the base context and instead reference it with "::"? Or should I load it in base and not use "::"?

berndbischl commented 9 years ago

1) First of all, always use :: if the package is not in depends or imports. Even then you might consider using ::.

2) We really dont want to depend on XML.

3) We cannot avoid XML in SUGGESTS right? Otherwise we dont pase R CMD check or?