mzechmeister / GLS

Generalised Lomb-Scargle periodogram
MIT License
35 stars 5 forks source link

How to interpret p and f output to determine periodicity/frequency? #4

Closed fastrocket closed 5 years ago

fastrocket commented 5 years ago

Hi, thanks for this library. I converted the C code to C# to test in my program (I'll submit a PR for that if you think it useful).

However, I'm new to GLS and periodograms, and am not sure how to interpret the result. I'm using it to detect a periodic signal that is randomly sampled.

Here is my setup. The t are 1000 UNIX timestamps (milliseconds since 1970) and the y are (1 / latency). Therefore, if the latency is low (good), then it will be a higher quality signal therefore a larger (good) value; whereas larger latency means a lower value. For e_y, I used the unmodified latency value itself which seemed to give a good result, but this could be omitted for similar results.

For the output, I selected:

fbeg = 0.0 nk = 1000 fsteps = 0.001

Graph of the resulting p array:

image

And the results from the GLS function are:

The two peaks occur at index 64 and 936. Here is the f array:

*f[64] = 0.064 f[936] = 0.936

The p values can change from run to run, but are roughly in the range of 0.04 to 0.07:

p[64] = 0.0384… p[936] = 0.0384…

So from this test, I'd expect the frequency to be centered between 21-35 milliseconds, but mostly on the lower end of that since my test emitter is sending them out roughly at this rate.

I'll be testing on live data where I don't control the emitter's rate.

So how do I interpret these results to arrive at the frequency? Thanks in advance for any tips.

mzechmeister commented 5 years ago

The second peak is an alias peak. Alias peaks arise when computing the periodogram beyond the Nyquist frequency which is well defined for regular sampling. As an example the following figure shows regular sampled data with out noise and two periods which exactly fits the data: The relation between alias frequencies is

where

Since there is practically no difference between positive and negative frequencies (the periodogram is symmetric), negative aliases are mirrored into the positive frequencies range.

I happy to host the GLS in other programming languages. If you make an PR, the code should be of course functional. And please provide also an example.

fastrocket commented 5 years ago

Sorry, I'm still not sure how to get the true frequency from your equation above.

I figured the second peak was some mirror of the first peak since it visually looks like a mirror (I experimented with various parameters to "look" at the output graphs). I also tested the algorithm by computing Sine of the irregularly sampled timestamps and verified that it seems to show only one peak. Based on that, I'm assuming that the alias frequency is a curiosity, but not a requirement to determine the true frequency.

(The C# conversion was fairly trivial. I'll put something together and make a PR. For the example, do you just want a simple test console program?)

mzechmeister commented 5 years ago

The equation provides the locations of all aliases. One of them will be the "true" frequency. But the equation cannot tell which one is the true frequency. This is the key issue of aliasing and the price you have pay when you go beyond the Nyquist limit. You have to think whether frequencies beyond the Nyquist limit are allowed at all. If yes, your data are undersampled. The only way to really break the aliasing is to improve, i.e. increase (denser), the data sampling. (Irregular sampling can also help, as you have already realised. But in this case the Nyquist limit is also not well defined).

Back to your example: Assuming f0= 0.064 and f-1 = 0.936, the sampling frequency is Ω=1. So the sampling is Δt=1/Ω=1 and the Nyquist limit is 0.5Ω, so fNy=0.5 in your case. This should be also the default for the end frequency in the GLS codes, e.g.: https://github.com/mzechmeister/GLS/blob/04530e7c5bb893305beb0ee164db07510a274389/python/gls.py#L219