Arcsin Transformation on Proportional Data and Development Rule-of-Thumb?

I am experimenting with developed attribute as a descriptor of the probability of no-flow. (Just the probability of no flow existing, not its proportion [percent decadal no flow].) As one might expect, as development goes up there is a tendency for no-flow to occur less often. Many permutations of models are showing this. This is a good thing!

There is some advice in R books on considering a 2*arcsin(sqrt(PERCENT/100)) transformation on proportional data. My experiments suggest that at least in a non-smooth GAM environment on the developed (trying to isolate in my mind its parametric impact via separate computations on each decade), that such a transformation is a good idea. Now this might be looking like a good idea because the developed attribution after 1950s is full range in the set [0, 80-90+).

I know that in Random Forests, SVM, etc classification that such transformation is not needed but I am currently thinking about those few real mutable attributes we have, like developed, for which isolation of single parametric coefficient in a model might have elucidate the communication of alternation impacts.

I am aware of some modern critique of arcsine transformation (http://www.mun.ca/biology/dschneider/b7932/B7932Final10Dec2010.pdf), but I want to emphasize that my point is transforming the covariate in the righthand side of the equation and not the transformation of the lefthand side, which is being done on the logit link function.

The 1950s are a little weird because of 400-year drought in much of Texas. Using then 1960 through 2000, here are the coefficients on the transformed developed: -1.629964, -2.288744, -1.107656, -1.368711, -1.406253, respectively.

*So one might have a shoot from hip estimate (unweighted) that generalized alteration effect of the logit is -1.5(2 asin(sqrt(developed/100))). The sign shows probability of no flows occurring decreases with development---consistent with impervious cover and leaky water lines and lawn return flows etc.**

Here are some results for the 2000 decade: Family: quasibinomial Link function: logit

Formula:
z ~ CDA + I(2 * asin(sqrt(developed/100))) + s(ANN_DNI, bs = "cr", 
    k = 8) + s(MAY, DEC, bs = "tp", k = 8) + s(x, y, bs = "so", 
    xt = list(bnd = bnd))
Parametric coefficients:
                                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        4.4784     0.7186   6.232 8.94e-10
CDA                               -2.2035     0.2679  -8.224 1.34e-15
I(2 * asin(sqrt(developed/100)))  -1.4063     0.3378  -4.163 3.63e-05 

Approximate significance of smooth terms:
              edf Ref.df     F  p-value    
s(ANN_DNI)  5.815  6.408 4.768 7.64e-05 ***
s(MAY,DEC)  2.002  2.004 0.355    0.701    
s(x,y)     22.035 28.000 2.502 4.04e-07 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) =  0.554   Deviance explained =   53%
GCV = 0.66203  Scale est. = 1.0577    n = 602

The binary state prediction success (jittering the y-values, amount=0.02): screen shot 2018-01-12 at 12 58 36 pm

The smooth on annual irradiance: screen shot 2018-01-12 at 1 00 23 pm

The smooth on the coupling between May and December irradiance: screen shot 2018-01-12 at 1 00 33 pm

(PS: You see me keeping with irradiance as it seems superior to the ppt_mean and temp_mean.)

The smooth on the x-y location: screen shot 2018-01-12 at 1 00 40 pm

scworland / restore-2018

Arcsin Transformation on Proportional Data and Development Rule-of-Thumb? #21