Performance enhancements of conditional logit

mathijsvdv commented 3 years ago

In the past, I've used Pylogit (specifically the MNL) on a large dataset of 200mln rows. I have noticed two bottlenecks:

The sparse matrix structure for weights_per_obs is not always kept, causing a 200 mln x 200 mln dense numpy array to be created, see also issue #79.
The derivatives dh_dv for a conditional logit represent an identity matrix but are coded as a csr_matrix. This causes the calculation dh_dv.dot(design) to be relatively slow even though its result is trivially design.

To remedy the first bottleneck, I used the same solution proposed in issue #79.

For the second bottleneck, I made an efficient identity_matrix class (derived from scipy's spmatrix). When such an identity matrix I is multiplied with A using I.dot(A) we get A again.

I've run a benchmark by making a script that estimates an MNL on the usual Swiss-Metro dataset. I ran the line-profiler on some of the critical functions, namely calc_gradient and calc_fisher_info_matrix. In summary, this change reduced the computation time of calc_gradient by 26% (from 0.080697 to 0.059372), and that of calc_fisher_info_matrix by 99% (!) (from 0.906896s to 0.0062323s).

Profiling results are attached. profile_before.txt profile_after.txt

timothyb0912 commented 3 years ago

Hi @mathijsvdv this is great! Wow:

the speedups are fantastic
the profiling is much appreciated
the super well documented class and tests are much appreciated.

Thanks also for your patience as I've been much delayed in responding to this PR and to the issue that spawned it. I should be able to take a look at this within a few days and update the package.

Thanks again for your help!

mathijsvdv commented 3 years ago

Glad you like it @timothyb0912 ! I really appreciate all the work you've done to make a flexible logit estimation suite so I'm happy to help!

timothyb0912 / pylogit

Performance enhancements of conditional logit #81