Closed medmediani closed 11 years ago
I presume that your dataset contains discrete integer values, otherwise it should not take too long (the continuous case is much simpler). If your data is discrete, you could use the -c
switch, which forces plfit
to assume that your data is continuous (even though it is discrete). plfit
will then run the continuous algorithm, which will give you a more-or-less accurate estimate of the alpha exponent and the xmin parameter. (See the Clauset-Shalizi-Newman paper for possible caveats; in general, you should not use continuous approximation if your data set is discrete, unless you are impatient and you don't mind that the results will not be accurate).
Otherwise, if your dataset is not a secret one, please upload it somewhere so I could download it and take a closer look at it -- maybe I could identify a bottleneck in my code which could be optimized a bit.
Many many thanks for your quick and detailed answer. My data is in fact continuous and therefore I indeed used the -c switch. It is not secret of course. It represents associations between words (kind of fr-en lexicon). I uploaded it here: http://www.4shared.com/zip/yMbwJnm4/rawchi-squaren2fscores.html
Thanks. I think that the problem is that in order to determine the best xmin
value, the algorithm tries all the xmin
values, estimates an alpha
for all of them and then selects from the results based on a Kolmogorov-Smirnov test. If you have many distinct values in your data set (which is usually the case for continuous values), the algorithm will try all distinct values for xmin
before choosing one.
I'm not sure yet what the best strategy would be in order to cut down the number of values that the algorithm tries (without losing out on precision), but one thing you could try is to:
plfit
with the -m
argument, which lets you ask plfit
to use a single xmin
only. Then compare the results you got and select the one with the smallest D
statistic.I'm hoping to make this automatic in the next release if I figure out a good and reliable way of doing this, but it would need some more testing.
Thanks a lot Tamas, It works much better this way.. So I did a stratified sampling selected the boundaries with the smallest D, then I did an exhaustive search on the selected boundaries and selected the smallest XMIN with the smallest D (the smallest I could get = 0.03227) . I hope this would be reliable for my future analyses. Thanks again,
I would recommend implementing a Powell search for large datasets. To the best of my experiences, it gives reasonably good solutions within a reasonable time compared to LBFGS, Simplex, and many many other heuristic optimization methods (actually we use it to tune around tens of weights of a log-linear model).
Thanks for the suggestion. I have used the L-BFGS method (although only in the discrete case; it is not used in the continuous case) because I have found a nice and self-contained BSD-licensed implementation for that so I did not have to brew my own. If you are aware of such an implementation of the Powell search, let me know.
Unfortunately, I don't know of any such library for Powell search in C (I personally use scipy optimization library). However, a basic implementation of the algorithm in c can be found in Numerical recipes in C, Chapter 10: http://www.fizyka.umk.pl/~jacek/docs/nrc/bookcpdf.html .
For the record: the current version uses golden section search to find the optimal value of xmin in the continuous case and falls back to the former exhaustive search method if the golden section search fails. This seems to work in the dataset mentioned by the poster, providing results almost immediately. I consider this issue fixed for the time being.
I am using this software trying to fit more than 2M data points. It is taking tooooo much time. I would expect that this is due to the algorithm itself not to the software. Could you please suggest a way to speed it up? Many thanks