rgiordan / zaminfluence

Tools in R for computing and using Z-estimator approximate influence functions.
Apache License 2.0
95 stars 10 forks source link

Inconsistent Behavior on Singular Data #45

Open ittai-rubinstein opened 9 months ago

ittai-rubinstein commented 9 months ago

Hi,

TLDR: I tried running ZAMInfluence on a regression with colinear columns (which is supported for instance in the reg command in R). Depending on rounding errors, this could cause one of several different bugs, making it hard to ascertain that this was the cause of the problem I was seeing. Supporting colinear regressions (e.g., by using pseudo-inverses), or checking for colinear columns could improve the usability of the code.

I was trying to test the robustness of a regression I was working on that had a couple of one-hot encodings (which I had generated in the python convention of not omitting the first value), and I tried running ZAMInfluence on it.

Sometimes I got an error need at least two non-NA values to interpolate in the line: approx_result <- approx(x=x_values, y=y_values, xout=signal) in the file influence_lib.R. Other times, I got no error at all, but junk results.

As far as I can tell, the determining factor of which error is thrown (if at all) is in the file ols_iv_grads_torch_lib.R, where tv$se_cov_mat is generated as the inverse of tv$zwx which is a PSD matrix that is not of full rank if the data has colinear columns. In the flow that created my error, the inversion of the singular matrix did not directly raise an error or produce any NAN values, but the diagonal elements of the inverse matrix could be negative (depending on random floating point magic). In cases where they were negative tv$betahat_se <- torch_sqrt(torch_diag(tv$se_cov_mat)) would result in NANs that propagated through the algorithm causing the error that I mentioned.

Other times, I think that I just got a direct error due to division by 0, and other times tv$betahat_se just had very large positive entries. This made the bug harder to find, as running ZAMInfluence on synthetic singular data did not replicate the odd behavior I was trying to debug.

Either adding support for colinear columns, or raising an error/warning when the matrix is not full rank might make the package easier to use in the future.

Thanks!