jeffalstott / powerlaw

600 stars 132 forks source link

Fitting user-specified PDF, e.g. power spectral density #62

Open smartass101 opened 6 years ago

smartass101 commented 6 years ago

I really like your software, it makes it easier to judge the hype of powerlaws in datasets. However, right now it focuses on fitting full datasets, creating their PDF and CDF on the fly. I'd like to use in situations where I already have a PDF (defined at several points) - or generally a distribution function of some sort - and fit its shape in some range. An example is the power spectral density of fluctuations in turbulent plasmas, where there is an ongoing discussion whether they are powerlaws or exponentials.

I'd be wiling to contribute modifications to powerlaw which would make this optional sue-case possible. But I would greatly appreciate if you could point out how best to approach this issue.

jeffalstott commented 6 years ago

Thanks, Ondrej!

For your needs, this is the relevant paper: https://projecteuclid.org/euclid.aoas/1396966280 I haven't implemented it, but there may be an implementation somewhere here: http://tuvalu.santafe.edu/~aaronc/powerlaws/ If a good implementation were created for powerlaw, I'd happily bring it on board.

On Tue, Oct 2, 2018 at 10:20 AM Ondrej Grover notifications@github.com wrote:

I really like your software, it makes it easier to judge the hype of powerlaws in datasets. However, right now it focuses on fitting full datasets, creating their PDf and CDF on the fly. I'd like to use in situations where I already have a PDF (defined at several points) - or generally a distribution function of some sort - and fits shape in some range. An example is the power spectral density of turbulent plasmas, where there is an ongoing discussion https://dx.doi.org/10.1103/PhysRevLett.107.185003 whether they are powerlaws or exponentials.

I'd be wiling to contribute modifications to powerlaw which would make this optional sue-case possible. But I would greatly appreciate if you could point out how best to approach this issue.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jeffalstott/powerlaw/issues/62, or mute the thread https://github.com/notifications/unsubscribe-auth/AA6_rwpkcC2lQ7V6ndeNkTO_JlVtoSrbks5ug3Y9gaJpZM4XEI98 .

smartass101 commented 6 years ago

Thank you for the reply. My naive hope was that it would suffice to simply enable the user to specify the cdf and bins directly, i.e. set self.fitting_cdf_bins, self.fitting_cdf without the actual data as done [here](self.fitting_cdf_bins, self.fitting_cdf). Then I would probably have to change operations later on to operate on the CDF instead of the data itself. Perhaps a reasonable approach would be to wrap the data in some object which would expose methods such as cdf, this would separate whatever source of the information on the data distribution from the actual calculation with the distribution. But perhaps I have missed some part where access to actual data is necessary. What do you think about this approach?

smartass101 commented 6 years ago

I also found out that their implementation of the operations on binned data is available at http://tuvalu.santafe.edu/~aaronc/powerlaws/bins/

jeffalstott commented 6 years ago

The methods currently in powerlaw do not do fitting based on the binned data; they work directly on the data points themselves. Binning is done only for visualizing PDFs (in a sense there is no binning for CDFs, which is actually a major reason to use them for visualization, as they do less damage to the data in presentation).

On Wed, Oct 3, 2018 at 2:44 AM Ondrej Grover notifications@github.com wrote:

Thank you for the reply. My naive hope was that it would suffice to simply enable the user to specify the cdf and bins directly, i.e. set self.fitting_cdf_bins, self.fitting_cdf without the actual data as done [here](self.fitting_cdf_bins, self.fitting_cdf). Then I would probably have to change operations later on to operate on the CDF instead of the data itself. Perhaps a reasonable approach would be to wrap the data in some object which would expose methods such as cdf, this would separate whatever source of the information on the data distribution from the actual calculation with the distribution. But perhaps I have missed some part where access to actual data is necessary. What do you think about this approach?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jeffalstott/powerlaw/issues/62#issuecomment-426528725, or mute the thread https://github.com/notifications/unsubscribe-auth/AA6_r_GX72BB3SSY38M4tKXzolSJM_YRks5uhFzbgaJpZM4XEI98 .

smartass101 commented 6 years ago

I've been reading that article and I began to realize that it may not be directly applicable to the PSD case. The reason is that most algorithms (FFT or wavelet) do not give the PSD as a histogram, but rather actual point-wise estimates, i.e. PSD(f_k) for all f_k. The f_k can be spaced either linearly (usually the case with FFT-based algorithms) or logarithmically (often the case in continuous wavelet analysis).

A dirty (probably not completely wrong, but neither right) workaround would be to generate surrogate datasets based on the pdf given by the PSD. I've seen it done e.g. here.

Perhaps I should get in touch with Clauset and ask him for guidance in this.

smartass101 commented 6 years ago

Clauset seems to be on sabbatical. I had another idea, perhaps I could simply use the Kolmogorov-Smirnov test to determine the "distance" between the PSD and a given distribution. Chi^2 might be an alternative. But that would mean determining the fitted parameters an f_k_min at the same, time, not sure if that would be a problem.

smartass101 commented 6 years ago

Mentioning directly @aaronclauset in case you have time (and interest) to comment.