Identify the sample used in a regression

lrberge / fixest

Fixed-effects estimations

https://lrberge.github.io/fixest/

379 stars 61 forks source link

Identify the sample used in a regression #142

Closed vronizor closed 3 years ago

vronizor commented 3 years ago

Hi @lrberge, thanks a lot for the great package!

Disclaimer: I am new to R and this might be a more general question on regressions in R, sorry if it doesn't belong here.

Coming from Stata, I'm used to the e(sample) command which lets the user identify, post-estimation, the sample used to run the regression. This can be useful to then compute the average of the dependent variable for the control group included in the estimation, for example.

I haven't found a way to do that with fixest. I've tried several proposed solutions but always ended up with a NULL result for an estimation I knew used only part of the full sample.

Is this at all possible? Might it be that the objects returned by lm as given in the links above are not the same as the ones returned by feols?

vronizor commented 3 years ago

Here is a mwe:

library(data.table)
library(fixest)

DT = as.data.table(airquality)

est = feols(Ozone ~ Solar.R + Wind + Temp | Month + Day, 
            DT, cluster = ~Day)
#> NOTE: 42 observations removed because of NA values (LHS: 37, RHS: 7).

# Method 1
DT[, used := TRUE]
DT[na.action(est), used := FALSE]
#> Null data.table (0 rows and 0 cols)

# Method 2
esample = rownames(as.matrix(resid(est)))
DT[esample]
#> Null data.table (0 rows and 0 cols)

^{Created on 2021-05-13 by the reprex package (v2.0.0)}

I noticed the NOTE: 42 observations removed because of NA values (LHS: 37, RHS: 7)., so there might be a way to retrieve that info :)

24thronin commented 3 years ago

Hi vronizor, I am also new to R and I had to do this today. I found the "obsRemoved" entry in the est object returned by feols can be used to retrieve the sample as follows (retrieves all collumns): sample <- DT[-est[["obsRemoved"]],]

or for just a list of rows used it should be sample <- setdiff(1:est[["nobs_origin"]], est[["obsRemoved"]])

vronizor commented 3 years ago

Thanks @24thronin, works perfectly! I need to get used to digging into these post-estimation objects, they are very handy!

lrberge commented 3 years ago

Hi @vronizor, and thanks John for bringing a solution.

I would just add a note of caution: the obsRemoved only considers NA values or obs. removed due to only 0 outcomes in fixed-effects for non linear models (in Poisson for instance). This means that it does not contain observations removed due to: a) the subset argument, b) the split argument, and c) NA/only-0 in multiple estimations (because a very specific delayed treatment is applied).

But it should work in most cases. To let you know, in 0.9.0 I'll add the obs() function to get the vector of observations used in the estimation, and it should account for everything.

I'm happy you found a solution and sorry for the delay!

adamaltmejd commented 3 years ago

Was just looking for this -- looking forward to obs() :), will it work even if lean = TRUE?

lrberge commented 3 years ago

@adamaltmejd: of course not! :-D The information is possibly of order n so it is removed.

But in the long run, any command will work even if lean = TRUE, it's only processing time that will be longer.