h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.92k stars 2k forks source link

Ideas on how to implement visualization of regression results #13007

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

This is from a conversation with Nate Woody on h2ostream google group. ( nate.a.woody@gmail.com )

Nate> Second, both validation measures you mention are for classification models. I'm still out in the woods with regression.

Tom> I’d appreciate it if you sent me a specific example of the use case you are advocating.

Tom> (eg. what the plot looks like, how you interpret it, and anything else you can think of that’s useful for us to know).

Tom> Also, please keep in mind from big data perspective: if you were dealing with a billion rows would your point of view change? if so, how?

Standard predicted vs actual: http://www.jmp.com/support/help/Graphs_for_Goodness_of_Fit.shtml For larger datasets, the individual points aren't useful and we use a density heatmap (Nice example: http://www.chrisstucchio.com/blog/2012/dont_use_scatterplots.html, but we use much higher resolution) or we calculate contours on the density. I also typically calculate multiple statistics (pearsons and spearmans, % within a log unit), so that we can better understand the error. Our data tops out at 1-2 million rows, so no practical opinion on billion row models. I model biological measurement data. The error is non-linear, the data is not uniformly distributed and we are more sensitive to predictive errors in some ranges than others. In other words, someone may not care about a high vs very high mis-prediction, but care greatly about a medium vs low mis-prediction. I want to know how badly the model is regressing to the mean and how evenly distributed the prediction error is.

exalate-issue-sync[bot] commented 1 year ago

Irene Lang commented: There are two different pieces for this. One is a visualization of final results - e.g.: scaled comparison of coeffs, and visualizations that are a continuation of the analysis, like residual plotting to test model assumptions. Which do we want to approach here?

On Wed, May 21, 2014 at 9:34 AM, Tom Kraljevic (JIRA) <

exalate-issue-sync[bot] commented 1 year ago

Prithvi Prabhu commented: I vote for density plots / contours or hex binning. Generating scatterplots for 10,000+ points is impractical in the browser.

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-2117 Assignee: Prithvi Prabhu Reporter: Tom Kraljevic State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A