Open mathijsvdv opened 3 years ago
Hi @mathijsvdv this is great! Wow:
Thanks also for your patience as I've been much delayed in responding to this PR and to the issue that spawned it. I should be able to take a look at this within a few days and update the package.
Thanks again for your help!
Glad you like it @timothyb0912 ! I really appreciate all the work you've done to make a flexible logit estimation suite so I'm happy to help!
In the past, I've used Pylogit (specifically the
MNL
) on a large dataset of 200mln rows. I have noticed two bottlenecks:weights_per_obs
is not always kept, causing a 200 mln x 200 mln dense numpy array to be created, see also issue #79.dh_dv
for a conditional logit represent an identity matrix but are coded as acsr_matrix
. This causes the calculationdh_dv.dot(design)
to be relatively slow even though its result is triviallydesign
.To remedy the first bottleneck, I used the same solution proposed in issue #79.
For the second bottleneck, I made an efficient
identity_matrix
class (derived from scipy'sspmatrix
). When such an identity matrixI
is multiplied withA
usingI.dot(A)
we getA
again.I've run a benchmark by making a script that estimates an
MNL
on the usual Swiss-Metro dataset. I ran theline-profiler
on some of the critical functions, namelycalc_gradient
andcalc_fisher_info_matrix
. In summary, this change reduced the computation time ofcalc_gradient
by 26% (from 0.080697 to 0.059372), and that ofcalc_fisher_info_matrix
by 99% (!) (from 0.906896s to 0.0062323s).Profiling results are attached. profile_before.txt profile_after.txt