Open exalate-issue-sync[bot] opened 1 year ago
Irene Lang commented: There are two different pieces for this. One is a visualization of final results - e.g.: scaled comparison of coeffs, and visualizations that are a continuation of the analysis, like residual plotting to test model assumptions. Which do we want to approach here?
On Wed, May 21, 2014 at 9:34 AM, Tom Kraljevic (JIRA) <
Prithvi Prabhu commented: I vote for density plots / contours or hex binning. Generating scatterplots for 10,000+ points is impractical in the browser.
JIRA Issue Migration Info
Jira Issue: PUBDEV-2117 Assignee: Prithvi Prabhu Reporter: Tom Kraljevic State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A
This is from a conversation with Nate Woody on h2ostream google group. ( nate.a.woody@gmail.com )
Nate> Second, both validation measures you mention are for classification models. I'm still out in the woods with regression.
Tom> I’d appreciate it if you sent me a specific example of the use case you are advocating.
Tom> (eg. what the plot looks like, how you interpret it, and anything else you can think of that’s useful for us to know).
Tom> Also, please keep in mind from big data perspective: if you were dealing with a billion rows would your point of view change? if so, how?
Standard predicted vs actual: http://www.jmp.com/support/help/Graphs_for_Goodness_of_Fit.shtml For larger datasets, the individual points aren't useful and we use a density heatmap (Nice example: http://www.chrisstucchio.com/blog/2012/dont_use_scatterplots.html, but we use much higher resolution) or we calculate contours on the density. I also typically calculate multiple statistics (pearsons and spearmans, % within a log unit), so that we can better understand the error. Our data tops out at 1-2 million rows, so no practical opinion on billion row models. I model biological measurement data. The error is non-linear, the data is not uniformly distributed and we are more sensitive to predictive errors in some ranges than others. In other words, someone may not care about a high vs very high mis-prediction, but care greatly about a medium vs low mis-prediction. I want to know how badly the model is regressing to the mean and how evenly distributed the prediction error is.