ntamas / plfit

Fitting power-law distributions to empirical data, according to the method of Clauset, Shalizi and Newman
GNU General Public License v2.0
47 stars 17 forks source link

Large performance discrepancy b/w compiled binary & Python wrapper? #24

Closed jyscao closed 4 years ago

jyscao commented 4 years ago

Hi there, firstly thanks for creating this super performant implementation of the Clauset power-law fitting method.

So following the README instructions, I built the executables as well as the Python module. Using the test dataset data/continuous_data.txt, the compiled executable produced the fitting in about 0.5 sec on my system (timed using the time command); while using the SWIG generated Python wrapper on the same input through an IPython REPL gave the following:

In [1]: import plfit

In [2]: import numpy as np

In [3]: dat = list(np.loadtxt('../../data/continuous_data.txt'))

In [4]: opts = plfit.plfit_continuous_options_t()

In [5]: %time fit = plfit.plfit_continuous(dat, opts)
CPU times: user 4.46 s, sys: 0 ns, total: 4.46 s
Wall time: 4.47 s

The fitted parameters are exactly the same (alpha=2.53282, xmin=1.43628), so that can't be a source of the performance differences, which at least to me is unexpectedly large.

Not sure if I did something wrong during building or in terms of usage. Any ideas?

ntamas commented 4 years ago

Thanks for the comment; this is interesting and I don't quite know how to explain it yet. I was suspecting that there is some overhead associated to converting a Python list to an array-of-floats in the C layer, but this cannot account for such a large difference.

I added some printf() statements in plfit_continuous() in the C layer to see what the "real" difference is in terms of execution time. I get 0.21s on my system from the CLI and 2.08s from Python so there is still a ~10x slowdown.

ntamas commented 4 years ago

Okay, it turns out you also need this line:

opts.xmin_method = plfit.PLFIT_STRATIFIED_SAMPLING

The CLI uses stratified sampling to "guesstimate" xmin before switching to a linear scan on a narrowed interval, while the default options that you create in Python use an exhaustive linear scan.

This is still a bug, though, because plfit_continuous_options_t() is supposed to set up the defaults properly.

ntamas commented 4 years ago

Fixed in 654ab50; the defaults are now set up correctly. I'll publish a patch release soon. Thanks a lot!