Closed zmjones closed 8 years ago
Ah ok.
I am not sure about the legend. I'll start looking into that.
Yes I think producing all the plots would be possible. Not sure about the performance but I guess I'll find out!
So the big problem with the last suggestion is that (D3) is all done client side. This is a dynamic server-side web application so we'd have to have a server running R to do that. RStudio of course has such a thing ready to go. Maybe using the free tier would be good for showing you all what I am doing? For users I think running things locally is pretty solid. I think I would use this when doing research.
Looking at the docs I don't think it is currently possible to interact with the legend. However I think I could add more controls; e.g. have two drop down menus for learners and measures respectively. If you pick more than one learner, you can only pick one measure, and vice versa. Do you think that would be better?
What I mean is that you could put the generated HTML/JS files in a gist and then put that through bl.ocks. If you have a machine that you can run a server on, even better!
If doing more controls is easy, I would just try it and then see how we like it. That could also be an option to the function (i.e. provide a list of things you want controls for).
That won't work for this though. To get the data reactivity the application has to be connected to a running R session. Shiny doesn't generate static javascript and html that do what shows up in the browser, that is generated dynamically by a server running R. Shiny is like Flask or Django in that sense. Maybe I could write something like a 'walker' (like frozen-flask) that would make all the possible requests, save the returned page, and then display them statically. That would be a lot of work I think. I doubt I could do that in a timely manner. Doing it all in Javascript directly would also be possible of course, but the start up costs of that would be high (for me anyhow).
Sounds good on the controls. I'll see how general I can make it. Perhaps then I can abstract the Shiny app into its own separate function that will work for all the visualization tasks we have.
Related to this discussion. Many of the methods which generate the data used by the plotting functions could be expanded to take advantage of this interactivity. For example getFilterValues
could take a list of methods, which could then be plotted in the same way as the example above.
Hmm, this almost sounds like you'd want to extend/change the data structure used to communicate between the generation and plotting functions. Otherwise the parameters to one would depend on the parameters of the other and make things quite brittle.
Regarding the "static" interactive visualisations, I think it's worth investigating this a little bit further to see what could be done. It will be very useful to be able to do something like that, especially for the tutorial and when generating HTML reports (e.g. with knitr). I'm quite familiar with Javascript and would be happy to help.
Ok a bit confused. Could you explain more what you mean in the first paragraph?
In the second, you are saying we should consider doing something other than ggvis that would produce something that does not require server side computation and is still interactive? It is your feeling that this is a dealbreaker for ggvis/shiny?
What I mean is that it may make sense in some cases to have more information come from the data generation functions to avoid duplication in the plotting functions. For example specifying which things you want controls for you need to know what things are available. You could give an explicit list to the plotting function, but that may break if the data changes and one of the things isn't available. I'm not sure to what extend it will be possible to avoid duplication here and still keep the data generation and actual plotting completely separate, but it's something to think about.
And yes, I'd love to see something that doesn't require a server and is still interactive to some extent (along the lines of ViperCharts, which we already have). It's not a deal breaker for me, but I think it would be very nice to have.
shall i go on hangout in 30 min to give you my 2 cents?
The earliest I can do is tomorrow morning, but feel free to meet without me.
ok zach tell me if you have only a few minutes
sorry was having a meeting with my advisor. i could do most anytime tomorrow though
ok, tomorrow after 17 CET then
Please tell me a time so I can plan a bit
oh sorry i took that as the default time. i am at my office working and can do anytime
I would like to hear your 2 cents, Bernd, so please include me.
So I am in the process of separating computing from the plotting, so we can have separate ggplot/ggvis functions. So far I've been giving the output data from the generating function an S3 class that is data.frame and the name of the plotting function that exists minus "plot", e.g., (new function) generateROCRCurves ouputs a data.frame with class "ROCRCurves" and is passed to either of the plotting functions. I've added attributes that might be relevant from the call to the generation function. Does that sound alright?
What's the purpose of the class that determines the type of plot? Do you envision having an almighty plot function that takes one of these data frames and then dispatches accordingly, or is the main purpose error checking and the like?
the latter. inside the ggplot version of plotROCRCurves for example i am using some information that you wouldn't want in the output data.frame necessarily (raw input parameters). i was still planning on having two plotting functions (not methods).
Ok, sounds reasonable.
Sounds also good to me. So its basically a data.frame with some extra info.
One note though: I usually do not use attributes that much. Maybe it is a question of style. But if you see that you need more than the df, maybe create a list and have the df in there cleanly as an element instead of burying large extra objects in attributes.
Ah ok. Most of the attributes are small, but if that is the preferred style that is not a problem. Should I write a print method for the object as well?
Yes please.
printer: Yes, sounds good to summarize the info. Most important: Document he structure, so a reasonably intelligent person understands what kind of info is where in there.
Ok will do. I'll be done with the separation of plotROCRCurves shortly and will push that today.
I added PRs for the tutorial and code. This separates out the plotting and computing. I have a print method for the object output from generateLearningCurve
(which just prints the first 5 rows of the data.frame). I ran all the tests, and ran check locally, as well as visually inspecting the roc page of the tutorial.
Sorry for not having set up travis to work with my fork yet. It is a little complicated; I will work on that tomorrow.
I created a wiki page for the project, which describes the ggvis
investigation.
Thanks, that's very detailed. Could you add some brief notes on what you found with respect to how well it works with knitr
?
Yes I will do that before I head home today. I was waiting on the PRs so that I can create a new branch based off of that where I start integrating ggvis alongside rather than in place of ggplot2. That way I can add to the tutorial (it is totally broken in my ggvis branch where i was replacing ggplot2). I can work around that though if I need to wait on the PRs.
I've merge both of them, so you should be good to go!
Ok cool, working on that now.
Would you rather me have separate commits for each ggvis function? Also is the naming scheme alright? E.g. plotROCRCurves_ggvis
. My plan now is to integrate all of the ones I've written, add the interactive functionality that makes sense with the data generation functions as they are, commit this, and then go back and expand the generation functions and add a bit more interaction (nothing crazy though). Then after that I'd move on to plotLearnerPrediction
or something else.
Would you rather me have separate commits for each ggvis function?
Dont care that much. But probably good like you suggest it
Also is the naming scheme alright? E.g. plotROCRCurves_ggvis.
No, we use only CamelCase not underscores in function names. Really stick to the StyleGuide please. Adding ggvis at the end is good IMHO as command completion then shows you the 2 versions next to each other? So maybe plotROCRCurvesGGVIS?
Ok, I thought as much, all the capitals just looked strange to me for some reason.
Ok, I thought as much, all the capitals just looked strange to me for some reason.
Yes I know. But we must stay consistent. For abbreviations I have usually chosen to use ALL-CAPITALS. Well, except for Irace maybe :(
So I know ggvis
works with knitr
: e.g., this, and their own docs. I haven't been able to get it to work correctly on my system. I am currently setting up a ubuntu virtual machine to verify that it is my system.
So I have figured out that (seemingly) all of the rendering of .Rmd
files with ggvis
plots is done using rmarkdown
, specifically rmarkdown::render
. This uses knitr
first and then calls pandoc which now is packaged with RStudio. I can generate html with embedded ggvis
plots outside of RStudio using rmarkdown::render
. The call to pandoc
in render
has a bunch of arguments. I think (including mathjax and so on, syntax highlighting, etc.). I think that one or more of these arguments is what allows render
to work where as vanilla use of knitr
to go from Rmd
to md
and pandoc to go from md
to html
does not (also the use of knitr::knit2html
doesn't work). Looking at the raw html/markdown it is clear that the information for the ggvis
plot is in there. I will add all of this to the wiki as well.
Great, thanks for investigating. Could you add a complete short example along with brief instructions on how to compile it please?
So I figured out the problem. I can now render the plots using knitr
and pandoc
separately or using rmarkdown::render
(which does this internally). My understanding is that mkdocs
does the rendering from markdown to html currently. The only thing we would have to change from the current setup is any page that has a ggvis
plot in it must have a number of javascript libraries in the header, all of which are packaged with ggvis
. I am not sure how the paths to the files are looked up but am trying to figure that out now. I put a small example in the wiki.
Thanks for investigating -- it sounds like that would require quite a number of changes to our current build pipeline for the tutorial. I guess we don't need to worry about it for now as we're keeping the ggplot
versions.
Yes it would. I don't think it would be that hard though. I think the easiest way would be to remove the dependency on mkdocs
and use rmarkdown::render
which goes from Rmd
to html
directly, automatically includes support for mathjax, etc. I'd need to do some things to turn it into a proper website (e.g. write a nav bar, header, and footer), and maybe some other styling.
What would you prefer I do? I could integrate all the ggvis
functions and just not put them in the tutorial for now, or I could do the above and then add them as well as put them in the tutorial.
Let's hold off putting this into the tutorial for now. It sounds like this would be quite a bit of "busywork" that's not related to your actual project.
Yes that is true. I'll add some info about it to the wiki for if/when we decide to do it.
I am basically done with all the basic ggvis functionality now, and am just fixing up docs, tests, etc. Should I go ahead and issue a PR for this?
The next thing I was going to do was to modify some of the generating functions to make a little more use of interactivity/facetting.
Many of the ggplot functions have arguments for linesize, pointsize, etc. that differ from the defaults but seem a bit strange to me. I sort of see those arguments as clutter. What do you think about removing them and only keeping the minimum? After all, we have the data generating functions for when people want more control over their plots.
I was also wondering why ggplot2 is in depends rather than imports.
PR: Yes please, when everything is there.
Next thing: Sounds good.
Arguments: I agree in general, and ideally we would have a way of supplying these optional additional arguments in a consistent manner across all plotting functions (i.e. use ...
or bundle them into a single argument that is then "decomposed" into several before passing on to ggplot2
). Could you have a look at what makes sense there please?
Depends: @berndbischl will know more about this, but it sounds to me like ggplot2
should probably in Imports (also see the discussion here).
I've had the passing additional arguments to ggplot2 problem before. As far as I'm aware there isn't a good way to do it. There are lots of functions that form a particular plot, many of which have the same arguments, making a big ambiguous mess. Maybe there is something I don't know of though.
Would it make sense to return the ggplot
object from the functions to allow the plot to be customised in the usual ggplot
fashion?
yea that is my thinking, except in cases where the customization has to do with the aesthetics, mapping a variable to color, linetype, etc. if we don't give the option to do that and it is something that would be useful thing then the user would have to use the data generation function and recreate the plot with the desired mapped aesthetic. if the user wants to change the pointsize of the geom_point()
layer though, they can just take the output and add geom_point(size = x)
and be on their way, as the layer will just be replaced. the only way i can see for that to get complicated is when layers have their own data. i think we only have 1 function that does that, and i could avoid that easily enough.
I don't think it would be reasonable to have a plotting function that, if you want to change simple things like e.g. colours, requires you to essentially reimplement it to change that.
It sounds like the best solution to this may be non-obvious, so why don't you go ahead for now without worrying about this for now and later, when you have more experience with implementing different things and how those could be modified, come back to it?
Ok will do. Probably a good idea to think/read a bit so that we can have a uniform way to customize the plots via the functions we have in mlr.
In the near future (before May 5) I plan on refactoring
plotROCRCurves
to use plotROC which usesggplot2
instead of base graphics. This package offers some extra functionality (compared to what is available now) which I'll document. I also hope to get at least one other smallish feature done by then.One option would be extending
plotLearnerPrediction
to cases with 3 features. I think the two obvious things to do here are to use one of the 3D plotting packages (I think plot3Drgl is nice). Another thing I'd definitely like to do is to use facetting for the third feature. With a discrete feature this is easy but it might be nice to add the ability to discretize one of the features as well. We could also plot 4 features by using the background color as well. In general it would be possible to layer on additional features in this way but it seems to have diminishing returns in terms of interpretability after 2 or 3 features.Another thing I could possibly do is to add an option to
performance
that lets you apply a measure to a subset of the feature space. I find this very useful for exploring the fit of learners, especially with data that is structured in some way. I haven't looked at the code forperformance
yet so i don't have an idea how much work that would entail. One problem i can see is that if some of the cells of the grouping are small the variance might be quite large. I am not sure whether that is out of the scope of the project. Is this is something others would like to have?When I get back (around May 16-17) I would like to finish up any residual work from the above first. I'd like to talk to Julia/Lars/Bernd about what I do next. I've had my nose in the EDA related functionality lately and so my inclination is to start working on that first. Alternatively I could start work on producing interactive versions of the existing plotting functionality.
I have found some papers recently that I think are worth prioritizing above the minor things in my proposal (dependent data resampling methods and standard error estimates for random forests and other ensemble methods). In particular Hooker 2012 and Drummond and Holte 2006.