mlr-org / mlr

Machine Learning in R
https://mlr.mlr-org.com
Other
1.64k stars 404 forks source link

gsoc-visualization #289

Closed zmjones closed 8 years ago

zmjones commented 9 years ago

In the near future (before May 5) I plan on refactoring plotROCRCurves to use plotROC which uses ggplot2 instead of base graphics. This package offers some extra functionality (compared to what is available now) which I'll document. I also hope to get at least one other smallish feature done by then.

One option would be extending plotLearnerPrediction to cases with 3 features. I think the two obvious things to do here are to use one of the 3D plotting packages (I think plot3Drgl is nice). Another thing I'd definitely like to do is to use facetting for the third feature. With a discrete feature this is easy but it might be nice to add the ability to discretize one of the features as well. We could also plot 4 features by using the background color as well. In general it would be possible to layer on additional features in this way but it seems to have diminishing returns in terms of interpretability after 2 or 3 features.

Another thing I could possibly do is to add an option to performance that lets you apply a measure to a subset of the feature space. I find this very useful for exploring the fit of learners, especially with data that is structured in some way. I haven't looked at the code for performance yet so i don't have an idea how much work that would entail. One problem i can see is that if some of the cells of the grouping are small the variance might be quite large. I am not sure whether that is out of the scope of the project. Is this is something others would like to have?

When I get back (around May 16-17) I would like to finish up any residual work from the above first. I'd like to talk to Julia/Lars/Bernd about what I do next. I've had my nose in the EDA related functionality lately and so my inclination is to start working on that first. Alternatively I could start work on producing interactive versions of the existing plotting functionality.

I have found some papers recently that I think are worth prioritizing above the minor things in my proposal (dependent data resampling methods and standard error estimates for random forests and other ensemble methods). In particular Hooker 2012 and Drummond and Holte 2006.

zmjones commented 9 years ago

Ah ok.

I am not sure about the legend. I'll start looking into that.

Yes I think producing all the plots would be possible. Not sure about the performance but I guess I'll find out!

So the big problem with the last suggestion is that (D3) is all done client side. This is a dynamic server-side web application so we'd have to have a server running R to do that. RStudio of course has such a thing ready to go. Maybe using the free tier would be good for showing you all what I am doing? For users I think running things locally is pretty solid. I think I would use this when doing research.

zmjones commented 9 years ago

Looking at the docs I don't think it is currently possible to interact with the legend. However I think I could add more controls; e.g. have two drop down menus for learners and measures respectively. If you pick more than one learner, you can only pick one measure, and vice versa. Do you think that would be better?

larskotthoff commented 9 years ago

What I mean is that you could put the generated HTML/JS files in a gist and then put that through bl.ocks. If you have a machine that you can run a server on, even better!

If doing more controls is easy, I would just try it and then see how we like it. That could also be an option to the function (i.e. provide a list of things you want controls for).

zmjones commented 9 years ago

That won't work for this though. To get the data reactivity the application has to be connected to a running R session. Shiny doesn't generate static javascript and html that do what shows up in the browser, that is generated dynamically by a server running R. Shiny is like Flask or Django in that sense. Maybe I could write something like a 'walker' (like frozen-flask) that would make all the possible requests, save the returned page, and then display them statically. That would be a lot of work I think. I doubt I could do that in a timely manner. Doing it all in Javascript directly would also be possible of course, but the start up costs of that would be high (for me anyhow).

Sounds good on the controls. I'll see how general I can make it. Perhaps then I can abstract the Shiny app into its own separate function that will work for all the visualization tasks we have.

zmjones commented 9 years ago

Related to this discussion. Many of the methods which generate the data used by the plotting functions could be expanded to take advantage of this interactivity. For example getFilterValues could take a list of methods, which could then be plotted in the same way as the example above.

larskotthoff commented 9 years ago

Hmm, this almost sounds like you'd want to extend/change the data structure used to communicate between the generation and plotting functions. Otherwise the parameters to one would depend on the parameters of the other and make things quite brittle.

Regarding the "static" interactive visualisations, I think it's worth investigating this a little bit further to see what could be done. It will be very useful to be able to do something like that, especially for the tutorial and when generating HTML reports (e.g. with knitr). I'm quite familiar with Javascript and would be happy to help.

zmjones commented 9 years ago

Ok a bit confused. Could you explain more what you mean in the first paragraph?

In the second, you are saying we should consider doing something other than ggvis that would produce something that does not require server side computation and is still interactive? It is your feeling that this is a dealbreaker for ggvis/shiny?

larskotthoff commented 9 years ago

What I mean is that it may make sense in some cases to have more information come from the data generation functions to avoid duplication in the plotting functions. For example specifying which things you want controls for you need to know what things are available. You could give an explicit list to the plotting function, but that may break if the data changes and one of the things isn't available. I'm not sure to what extend it will be possible to avoid duplication here and still keep the data generation and actual plotting completely separate, but it's something to think about.

And yes, I'd love to see something that doesn't require a server and is still interactive to some extent (along the lines of ViperCharts, which we already have). It's not a deal breaker for me, but I think it would be very nice to have.

berndbischl commented 9 years ago

shall i go on hangout in 30 min to give you my 2 cents?

larskotthoff commented 9 years ago

The earliest I can do is tomorrow morning, but feel free to meet without me.

berndbischl commented 9 years ago

ok zach tell me if you have only a few minutes

zmjones commented 9 years ago

sorry was having a meeting with my advisor. i could do most anytime tomorrow though

berndbischl commented 9 years ago

ok, tomorrow after 17 CET then

berndbischl commented 9 years ago

Please tell me a time so I can plan a bit

zmjones commented 9 years ago

oh sorry i took that as the default time. i am at my office working and can do anytime

schiffner commented 9 years ago

I would like to hear your 2 cents, Bernd, so please include me.

zmjones commented 9 years ago

So I am in the process of separating computing from the plotting, so we can have separate ggplot/ggvis functions. So far I've been giving the output data from the generating function an S3 class that is data.frame and the name of the plotting function that exists minus "plot", e.g., (new function) generateROCRCurves ouputs a data.frame with class "ROCRCurves" and is passed to either of the plotting functions. I've added attributes that might be relevant from the call to the generation function. Does that sound alright?

larskotthoff commented 9 years ago

What's the purpose of the class that determines the type of plot? Do you envision having an almighty plot function that takes one of these data frames and then dispatches accordingly, or is the main purpose error checking and the like?

zmjones commented 9 years ago

the latter. inside the ggplot version of plotROCRCurves for example i am using some information that you wouldn't want in the output data.frame necessarily (raw input parameters). i was still planning on having two plotting functions (not methods).

larskotthoff commented 9 years ago

Ok, sounds reasonable.

berndbischl commented 9 years ago

Sounds also good to me. So its basically a data.frame with some extra info.

One note though: I usually do not use attributes that much. Maybe it is a question of style. But if you see that you need more than the df, maybe create a list and have the df in there cleanly as an element instead of burying large extra objects in attributes.

zmjones commented 9 years ago

Ah ok. Most of the attributes are small, but if that is the preferred style that is not a problem. Should I write a print method for the object as well?

larskotthoff commented 9 years ago

Yes please.

berndbischl commented 9 years ago

printer: Yes, sounds good to summarize the info. Most important: Document he structure, so a reasonably intelligent person understands what kind of info is where in there.

zmjones commented 9 years ago

Ok will do. I'll be done with the separation of plotROCRCurves shortly and will push that today.

zmjones commented 9 years ago

I added PRs for the tutorial and code. This separates out the plotting and computing. I have a print method for the object output from generateLearningCurve (which just prints the first 5 rows of the data.frame). I ran all the tests, and ran check locally, as well as visually inspecting the roc page of the tutorial.

Sorry for not having set up travis to work with my fork yet. It is a little complicated; I will work on that tomorrow.

zmjones commented 9 years ago

I created a wiki page for the project, which describes the ggvis investigation.

larskotthoff commented 9 years ago

Thanks, that's very detailed. Could you add some brief notes on what you found with respect to how well it works with knitr?

zmjones commented 9 years ago

Yes I will do that before I head home today. I was waiting on the PRs so that I can create a new branch based off of that where I start integrating ggvis alongside rather than in place of ggplot2. That way I can add to the tutorial (it is totally broken in my ggvis branch where i was replacing ggplot2). I can work around that though if I need to wait on the PRs.

larskotthoff commented 9 years ago

I've merge both of them, so you should be good to go!

zmjones commented 9 years ago

Ok cool, working on that now.

Would you rather me have separate commits for each ggvis function? Also is the naming scheme alright? E.g. plotROCRCurves_ggvis. My plan now is to integrate all of the ones I've written, add the interactive functionality that makes sense with the data generation functions as they are, commit this, and then go back and expand the generation functions and add a bit more interaction (nothing crazy though). Then after that I'd move on to plotLearnerPrediction or something else.

berndbischl commented 9 years ago

Would you rather me have separate commits for each ggvis function?

Dont care that much. But probably good like you suggest it

berndbischl commented 9 years ago

Also is the naming scheme alright? E.g. plotROCRCurves_ggvis.

No, we use only CamelCase not underscores in function names. Really stick to the StyleGuide please. Adding ggvis at the end is good IMHO as command completion then shows you the 2 versions next to each other? So maybe plotROCRCurvesGGVIS?

zmjones commented 9 years ago

Ok, I thought as much, all the capitals just looked strange to me for some reason.

berndbischl commented 9 years ago

Ok, I thought as much, all the capitals just looked strange to me for some reason.

Yes I know. But we must stay consistent. For abbreviations I have usually chosen to use ALL-CAPITALS. Well, except for Irace maybe :(

zmjones commented 9 years ago

So I know ggvis works with knitr: e.g., this, and their own docs. I haven't been able to get it to work correctly on my system. I am currently setting up a ubuntu virtual machine to verify that it is my system.

zmjones commented 9 years ago

So I have figured out that (seemingly) all of the rendering of .Rmd files with ggvis plots is done using rmarkdown, specifically rmarkdown::render. This uses knitr first and then calls pandoc which now is packaged with RStudio. I can generate html with embedded ggvis plots outside of RStudio using rmarkdown::render. The call to pandoc in render has a bunch of arguments. I think (including mathjax and so on, syntax highlighting, etc.). I think that one or more of these arguments is what allows render to work where as vanilla use of knitr to go from Rmd to md and pandoc to go from md to html does not (also the use of knitr::knit2html doesn't work). Looking at the raw html/markdown it is clear that the information for the ggvis plot is in there. I will add all of this to the wiki as well.

larskotthoff commented 9 years ago

Great, thanks for investigating. Could you add a complete short example along with brief instructions on how to compile it please?

zmjones commented 9 years ago

So I figured out the problem. I can now render the plots using knitr and pandoc separately or using rmarkdown::render (which does this internally). My understanding is that mkdocs does the rendering from markdown to html currently. The only thing we would have to change from the current setup is any page that has a ggvis plot in it must have a number of javascript libraries in the header, all of which are packaged with ggvis. I am not sure how the paths to the files are looked up but am trying to figure that out now. I put a small example in the wiki.

larskotthoff commented 9 years ago

Thanks for investigating -- it sounds like that would require quite a number of changes to our current build pipeline for the tutorial. I guess we don't need to worry about it for now as we're keeping the ggplot versions.

zmjones commented 9 years ago

Yes it would. I don't think it would be that hard though. I think the easiest way would be to remove the dependency on mkdocs and use rmarkdown::render which goes from Rmd to html directly, automatically includes support for mathjax, etc. I'd need to do some things to turn it into a proper website (e.g. write a nav bar, header, and footer), and maybe some other styling.

What would you prefer I do? I could integrate all the ggvis functions and just not put them in the tutorial for now, or I could do the above and then add them as well as put them in the tutorial.

larskotthoff commented 9 years ago

Let's hold off putting this into the tutorial for now. It sounds like this would be quite a bit of "busywork" that's not related to your actual project.

zmjones commented 9 years ago

Yes that is true. I'll add some info about it to the wiki for if/when we decide to do it.

zmjones commented 9 years ago

I am basically done with all the basic ggvis functionality now, and am just fixing up docs, tests, etc. Should I go ahead and issue a PR for this?

The next thing I was going to do was to modify some of the generating functions to make a little more use of interactivity/facetting.

Many of the ggplot functions have arguments for linesize, pointsize, etc. that differ from the defaults but seem a bit strange to me. I sort of see those arguments as clutter. What do you think about removing them and only keeping the minimum? After all, we have the data generating functions for when people want more control over their plots.

I was also wondering why ggplot2 is in depends rather than imports.

larskotthoff commented 9 years ago

PR: Yes please, when everything is there.

Next thing: Sounds good.

Arguments: I agree in general, and ideally we would have a way of supplying these optional additional arguments in a consistent manner across all plotting functions (i.e. use ... or bundle them into a single argument that is then "decomposed" into several before passing on to ggplot2). Could you have a look at what makes sense there please?

Depends: @berndbischl will know more about this, but it sounds to me like ggplot2 should probably in Imports (also see the discussion here).

zmjones commented 9 years ago

I've had the passing additional arguments to ggplot2 problem before. As far as I'm aware there isn't a good way to do it. There are lots of functions that form a particular plot, many of which have the same arguments, making a big ambiguous mess. Maybe there is something I don't know of though.

larskotthoff commented 9 years ago

Would it make sense to return the ggplot object from the functions to allow the plot to be customised in the usual ggplot fashion?

zmjones commented 9 years ago

yea that is my thinking, except in cases where the customization has to do with the aesthetics, mapping a variable to color, linetype, etc. if we don't give the option to do that and it is something that would be useful thing then the user would have to use the data generation function and recreate the plot with the desired mapped aesthetic. if the user wants to change the pointsize of the geom_point() layer though, they can just take the output and add geom_point(size = x) and be on their way, as the layer will just be replaced. the only way i can see for that to get complicated is when layers have their own data. i think we only have 1 function that does that, and i could avoid that easily enough.

larskotthoff commented 9 years ago

I don't think it would be reasonable to have a plotting function that, if you want to change simple things like e.g. colours, requires you to essentially reimplement it to change that.

It sounds like the best solution to this may be non-obvious, so why don't you go ahead for now without worrying about this for now and later, when you have more experience with implementing different things and how those could be modified, come back to it?

zmjones commented 9 years ago

Ok will do. Probably a good idea to think/read a bit so that we can have a uniform way to customize the plots via the functions we have in mlr.