lrberge / fixest

Fixed-effects estimations
https://lrberge.github.io/fixest/
377 stars 59 forks source link

Conflicting benchmark results #402

Closed waynelapierre closed 8 months ago

waynelapierre commented 1 year ago

The Julia package FixedEffectModels claims that it is faster than fixest on its website: https://github.com/FixedEffects/FixedEffectModels.jl. Meanwhile, the fixest website provides conflicting benchmark results: https://github.com/lrberge/fixest. So, which one is faster? I also found this post: https://discourse.julialang.org/t/my-experience-as-a-julia-and-r-user/83613/9.

grantmcdermott commented 1 year ago

The correct answer—which I think is fairly represented by the benchmarks on this site—is "it depends". Benchmarks are notoriously sensitive to specification setups, especially when two libraries are as closely equivalent like fixest and FixedEffectModels. This is clearly evident from the richer suite of benchmarks that Laurent has provided on the fixest page. In some cases, fixest is faster. In others, FixedEffectModels is faster. Both are very well written libraries and tend to be a lot faster than the other options on the table.

Now, I will say that these benchmarks do not capture some of the unique strengths of fixest, e.g. the ability to efficiently estimate multiple models simultaneously, fast shortcuts for interacted FEs, on-the-fly SE switching, etc... not to mention the additional flexibility for handling various non-linear model family types. (FixedEffectModels only handles linear models IIRC.) If you take these into consideration, then fixest generally occupies a unique pole position, as it were.

As it happens, I have a set of standing benchmarks that I keep for myself on various data tasks across multiple languages (R, Python, Julia, Stata). I use the well-known New York taxi data; it's not perfect but I prefer using real-life datasets than simulated ones. Here are the results for a single (linear) FE regression on 3 months' worth of taxi data (~45m rows). The task is to regress tip_amount on trip_distance + passenger_count, controlling for three separate levels of fixed effects: day_of_week + vendor_id + payment_type.

fixest::feols (R):          15.2 sec
FixedEffectModels (Julia):  19.6 sec
reghdfe (Stata):           357.0 sec

Do I think this particular benchmark is representative of every possible use case? Of course, not. But I do think you'd be hard-pressed to find a context where fixest is neither the fastest option or a close second. And you might say the same for FixedEffectModels.

(I should add that other libraries reghdfe and lfe also have much to be said in their favour, particular since they paved the way for these kinds of HDFE libraries.)

PS. The above benchmarks all impose 6 parallel threads and are using the latest available versions of each package. If I instead impose single threading, then feols increases to 23.4s and FixedEffectModels to 24.3s.

lrberge commented 8 months ago

Thanks Grant for the answer.

@waynelapierre: there is no benchmarks which handles all the possible cases. We're facing complex algorithms with an enormous amount of branches. So by definition it is impossible to summarize the performance in one run.

Simple differences: like is the data integer instead of double, or is the fixed-effect in character form instead of numeric, can have an impact on performance. Plus since the demeaning algorithms are different, their convergence properties are different. This does not matter with simple cases, but for complex situations (with difficult convergence), one algo can be better than the other, and in another situation it's the reverse. Some algorithms have overheads (like fixest), other don't, which can be seen in benchmarks with small n. Some algorithms prepare data for later stages (like in fixest, it anticipates that something is done with the standard-errors so the scores are computed in advance -- and this is costly), other don't, also creating differences.

All this to say that unless the differences in performance are blatant (over 3 times for large data+ scaling), benchmarks to determine who's the fastest are a moot point.