Issue with huge dataset

medmediani commented 12 years ago

I am using this software trying to fit more than 2M data points. It is taking tooooo much time. I would expect that this is due to the algorithm itself not to the software. Could you please suggest a way to speed it up? Many thanks

ntamas commented 12 years ago

I presume that your dataset contains discrete integer values, otherwise it should not take too long (the continuous case is much simpler). If your data is discrete, you could use the -c switch, which forces plfit to assume that your data is continuous (even though it is discrete). plfit will then run the continuous algorithm, which will give you a more-or-less accurate estimate of the alpha exponent and the xmin parameter. (See the Clauset-Shalizi-Newman paper for possible caveats; in general, you should not use continuous approximation if your data set is discrete, unless you are impatient and you don't mind that the results will not be accurate).

Otherwise, if your dataset is not a secret one, please upload it somewhere so I could download it and take a closer look at it -- maybe I could identify a bottleneck in my code which could be optimized a bit.

medmediani commented 12 years ago

Many many thanks for your quick and detailed answer. My data is in fact continuous and therefore I indeed used the -c switch. It is not secret of course. It represents associations between words (kind of fr-en lexicon). I uploaded it here: http://www.4shared.com/zip/yMbwJnm4/rawchi-squaren2fscores.html

ntamas commented 12 years ago

Thanks. I think that the problem is that in order to determine the best xmin value, the algorithm tries all the xmin values, estimates an alpha for all of them and then selects from the results based on a Kolmogorov-Smirnov test. If you have many distinct values in your data set (which is usually the case for continuous values), the algorithm will try all distinct values for xmin before choosing one.

I'm not sure yet what the best strategy would be in order to cut down the number of values that the algorithm tries (without losing out on precision), but one thing you could try is to:

Find all the disjoint values from your dataset
Select a small fraction of them uniformly randomly (say, sort them in increasing order and select every 100th)
For each selected value, run plfit with the -m argument, which lets you ask plfit to use a single xmin only. Then compare the results you got and select the one with the smallest D statistic.

I'm hoping to make this automatic in the next release if I figure out a good and reliable way of doing this, but it would need some more testing.

medmediani commented 12 years ago

Thanks a lot Tamas, It works much better this way.. So I did a stratified sampling selected the boundaries with the smallest D, then I did an exhaustive search on the selected boundaries and selected the smallest XMIN with the smallest D (the smallest I could get = 0.03227) . I hope this would be reliable for my future analyses. Thanks again,

medmediani commented 12 years ago

I would recommend implementing a Powell search for large datasets. To the best of my experiences, it gives reasonably good solutions within a reasonable time compared to LBFGS, Simplex, and many many other heuristic optimization methods (actually we use it to tune around tens of weights of a log-linear model).

ntamas commented 12 years ago

Thanks for the suggestion. I have used the L-BFGS method (although only in the discrete case; it is not used in the continuous case) because I have found a nice and self-contained BSD-licensed implementation for that so I did not have to brew my own. If you are aware of such an implementation of the Powell search, let me know.

medmediani commented 12 years ago

Unfortunately, I don't know of any such library for Powell search in C (I personally use scipy optimization library). However, a basic implementation of the algorithm in c can be found in Numerical recipes in C, Chapter 10: http://www.fizyka.umk.pl/~jacek/docs/nrc/bookcpdf.html .

ntamas commented 11 years ago

For the record: the current version uses golden section search to find the optimal value of xmin in the continuous case and falls back to the former exhaustive search method if the golden section search fails. This seems to work in the dataset mentioned by the poster, providing results almost immediately. I consider this issue fixed for the time being.

ntamas / plfit

Issue with huge dataset #6