Issue with Handling Special Characters in e.g. Polynomial Expressions in feglm

Rasmusma1995 commented 6 months ago

Hello and first off, thank you for developing this fantastic package! It's been incredibly useful.

That being said it does however seem to have a problem with combining special characters from e.g. foreign languages with polynomial expressions of covariates. It seems like fixest::feglm function misinterpret the formula, leading to an error.

Small example:

data <- data.frame(
  y = rpois(1000, 1),
  gender = sample(c(0,1), 1000, replace = T),
  Løn = sample(seq(1e5,1e6,1e3), 1000, replace = T), #Danish for salary
  salary = sample(seq(1e5,1e6,1e3), 1000, replace = T)
)

fixest::feglm(
  y~ gender + Løn,
  data
)

###
### this works fine and yields
###

#> GLM estimation, family = gaussian, Dep. Var.: y
#> Observations: 1,000 
#> Standard-errors: IID 
#>               Estimate  Std. Error   t value  Pr(>|t|)    
#> (Intercept) 0.956695917 0.080613179 11.867736 < 2.2e-16 ***
#> gender      0.023749757 0.063454728  0.374279   0.70828    
#> Løn         0.000000033 0.000000121  0.273088   0.78484    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Log-Likelihood: -1,419.2   Adj. Pseudo R2: -0.001335
#>            BIC:  2,859.2     Squared Cor.:  2.105e-4

###
### furthere more this also 
###

fixest::feglm(
  y~ gender + salary^2,
  data
)

###
### However, combining these steps i.e
###

fixest::feglm(
  y~ gender + Løn^2,
  data
)

###
### Yields an error of
###

# Error in fixest::feglm(y ~ gender + Løn^2, data) : 
#   Evaluation of the right-hand-side of the formula raises an error: 
#   In LøI(n^2): could not find function "LøI"

###
### The error seems to arise from your internal function fixest_fml_rewriter. "This leads to an unwanted rewriting of the formula expression as:
###

fixest:::fixest_fml_rewriter(as.formula(y~ gender + Løn^2))

# $fml
# y ~ gender + LøI(n^2)
# <environment: 0x5621a72d5cd8>
#   
#   $isPanel
# [1] FALSE

lrberge commented 6 months ago

Hi, and glad you find the software useful!

Hmmm, it works on my machine:

fixest:::fixest_fml_rewriter(as.formula(y~ gender + Løn^2))
$fml
y ~ gender + I(Løn^2)
<environment: 0x000001c6437a03e8>

$isPanel
[1] FALSE

The current rewriting of "x^2" into "I(x^2)" uses a lot of regular expressions. In particular, I use "[[:alnum:]]" to catch letters and deduce variables' names.

Can you replicate the following result?

gsub("[[:alnum:]]", "_", "Løn^2")
[1] "___^_"

If not, it seems that the current interpretation of the character signs differ between your machine and mine. Possible solutions:

update the version of R?
change the encoding of your file to UTF8?

In any case, writing explicitly " I(Løn^2)" should work (and this is the native R way to do it).

Rasmusma1995 commented 6 months ago

It seem that gsub does produce the same result:

gsub("[[:alnum:]]", "_", "Løn^2")
[1] "___^_"

However explicitly writing "I(Løn^2)" produce an even weirder result:

 fixest:::fixest_fml_rewriter(as.formula(y~ gender + I(Løn^2)))
$fml
y ~ gender + I(LøI(n^2))
<environment: 0x56247917cad0>

$isPanel
[1] FALSE

The problem seem to arise from the following steps:

fml_text = fixest:::deparse_long(as.formula(y~ gender + Løn^2))
fml_text
[1] "y ~ gender + Løn^2"

no_lhs_text = gsub("^[^~]+~", "", fml_text)
no_lhs_text
[1] " gender + Løn^2"

no_lhs_text = gsub("(?<!I\\()(\\b(\\.[[:alpha:]]|[[:alpha:]])[[:alnum:]\\._]*\\^[[:digit:]]+)", "I(\\1)", 
                    no_lhs_text, perl = TRUE)
no_lhs_text
[1] " gender + LøI(n^2)"

Session info:

R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS:   /opt/R/4.2.1/lib64/R/lib/libRblas.so
LAPACK: /opt/R/4.2.1/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
character(0)

other attached packages:
[1] fixest_0.11.2

lrberge / fixest

Issue with Handling Special Characters in e.g. Polynomial Expressions in feglm #462