timothyb0912 / pylogit

A python package for estimating conditional logit models.
https://pypi.org/project/pylogit/
BSD 3-Clause "New" or "Revised" License
187 stars 103 forks source link

Performance enhancements of conditional logit #81

Open mathijsvdv opened 3 years ago

mathijsvdv commented 3 years ago

In the past, I've used Pylogit (specifically the MNL) on a large dataset of 200mln rows. I have noticed two bottlenecks:

  1. The sparse matrix structure for weights_per_obs is not always kept, causing a 200 mln x 200 mln dense numpy array to be created, see also issue #79.
  2. The derivatives dh_dv for a conditional logit represent an identity matrix but are coded as a csr_matrix. This causes the calculation dh_dv.dot(design) to be relatively slow even though its result is trivially design.

To remedy the first bottleneck, I used the same solution proposed in issue #79.

For the second bottleneck, I made an efficient identity_matrix class (derived from scipy's spmatrix). When such an identity matrix I is multiplied with A using I.dot(A) we get A again.

I've run a benchmark by making a script that estimates an MNL on the usual Swiss-Metro dataset. I ran the line-profiler on some of the critical functions, namely calc_gradient and calc_fisher_info_matrix. In summary, this change reduced the computation time of calc_gradient by 26% (from 0.080697 to 0.059372), and that of calc_fisher_info_matrix by 99% (!) (from 0.906896s to 0.0062323s).

Profiling results are attached. profile_before.txt profile_after.txt

timothyb0912 commented 3 years ago

Hi @mathijsvdv this is great! Wow:

Thanks also for your patience as I've been much delayed in responding to this PR and to the issue that spawned it. I should be able to take a look at this within a few days and update the package.

Thanks again for your help!

mathijsvdv commented 3 years ago

Glad you like it @timothyb0912 ! I really appreciate all the work you've done to make a flexible logit estimation suite so I'm happy to help!