werefkin commented 3 years ago

Hi Nikolay!

I have found that your Allan-Variance algorithm returns quite different output than another python implementation. So with a close look into the code, I have found that if the line:

X = np.cumsum(x, axis=0) where x is the input array,

is replaced with X = x, i.e. not cumulative sum, then I receive equivalent results.

I think I do not completely understand the reasoning for the cumsum in the context of e.g. sensor readings and to analyse the stability of e.g. power in the power socket. Could you please clarify -- I will be very grateful?

Best, Ivan

werefkin commented 3 years ago

I would like to supplement my question by providing a minimal working example, let's say the data (i.e. measurement of the concentration of anything by a device that exhibits noises and drifts) looks like that: Calculated variances: And if the cumsum is removed

nmayorov commented 3 years ago

Hi, Ivan!

I believe the two implementations use difference conventions.

Here I found a concise definition of Allan variance I used in my implementation:

So I assume that my function accepts time series of "Omega" or integrals of "Omega" over dt time intervals (determined by input_type argument). To compute required cluster averages I computed what is called "theta" in the extract and X in the code.

From the looks of it, the other implementation assumes that "theta" is given as the input argument. In more general terms - it accepts cumulative integral of the quantity you are looking to analyze, e.g. clock phase for clock frequency stability analysis.

If you give me details about your data I can probably advice you on how you should compute the Allan variance.

werefkin commented 3 years ago

Привет Nikolay,

thank you for the detailed reply! Although it would be cool to see the next page as well (already found in Analysis and Modeling of Inertial Sensors Using Allan Variance - Naser El-Sheimy, Haiying Hou, and Xiaoji Niu) since I totally agree with the listed formulas (3-6), however, I cannot really understand the meaning of (7) (maybe due to the different field of science), i.e. the part saying "in terms of output angle and velocity". Is it something specific for inertial sensor and gyroscopes?

Looking at the next page I would say that my data-type is already Theta (see the details below)

On the weekend I have written my own implementation, which is in agreement with eqs. 3-6, although it is not an overlapping version. So I split my data set into subsets of the different length (max(length) = N/2) and calculate the variance between these subsets (i.e. between their means). The data type is simply an average output power of a laser (i.e. light intensity), so the power fluctuates around some mean for some reasons and there are daily drifts as well as short random drifts to analyse. I believe that the cumulative sum has not a lot of meaning here, but if you have an idea just let me know.

I have written my code based on the publication of Werle who first applied Allan Variance to spectroscopy (https://www.researchgate.net/publication/227022298_The_Limits_of_Signal_Averaging_in_Atmospheric_Trace-Gas_Monitoring_by_Tunable_Diode-Laser_Absorption_Spectroscopy_TDLAS).

Here are some parts from this publication: Intro to AVAR 2 3

and if I apply my code or your code for 'x' not 'X' I obtain exactly the same results as Werle for his simulated data. I really believe that Allan Variance is a kind of universal tool, and from this point of view I am not sure about the cumsum that seems to be field-specific, I can also be wrong and you may change my mind :-) Looking forward to your comments.

PS another implementation (allantools https://github.com/aewallin/allantools) has an argument to switch between x and X (phase of frequency data type), which I also recently found.

Best, Ivan

nmayorov commented 3 years ago

"in terms of output angle and velocity" means that integrated angular rate and acceleration give angle and velocity respectively.
In the implementation X ("theta") is used to speed up computation of required cluster averages. It implements the same "sum of squared differences of consecutive cluster averages" logic. Combining formulas (6), (8) and (9) you will come up with the formula (10). I suggest that you pay closer attention to that, it's a matter of simple algebra. It is not domain specific, it is just the way to compute Allan variance for "angular rate" or "frequency" input type.
Formula 8 in the paper you mentioned implements the same logic as in my function. It might be some inconsistency in the explanation and the data you mentioned.

Simple example: if the input sequence x is white noise with standard deviation of sigma, then Allan variance has the form sigma**2 / tau. And this is a well known fact about Allan variance of white noise acknowledged in the paper you mentioned:

Basically if the data looks like white noise (visually uniform band without long-term effects) its Allan variance should be ~1/tau. If the data on the plot you've posted is to be interpreted as rate/frequency data, then its Allan variance certainly should not have decaying form like ~1/tau.

So the blue line here makes sense - there are some long-term instabilities in the data (if interpreted as rate/frequency):

As you said, the allantools package accepts 2 type inputs. So the "frequency" input is one of the valid options and it is used here, I can add "phase" option as well. You just need to figure out how to interpret your data - frequency/rate or phase/angle.

Overall I still don't understand the source of your confusion.

Can you upload your data and/or "his simulated data"?

werefkin commented 3 years ago

Hi! OK, I see the reasoning for cumsum and totally agree now, so it makes it more elegant to calculate Y (clusters averages) out of the dataset (than in loops), the point of confusion is resolved. Actually, the confusion arised due to the fact that the algorithms use different default arguments, see a simple example below just to illustrate the point. And please excuse me as when I asked the question I was not sure 1. which data type is "correct" and 2. I did not compare the output for my check-code versus yours again (at this point I was sure how it should look like)

If we define the data as follows (as in the abovementioned paper): N=600 a=12 data=np.zeros(N) for i in range(0,len(data)): data[i]=a-5e-5*i+np.random.normal(0, 0.01)

So white noise and some drift (to have a minimum of Allan plot). the data set would look like that

Then I calculate a standard i.e. non-overlapping AVAR (as defined in the above-mentioned paper): ` n=len(data) dt=1

clusters=np.unique(np.arange(0,int(np.round(n/2)))+1).astype(np.int64)

sig2=[]

for clus in clusters:

print(clus)

M=int(np.floor(n/clus))
As=[]
As_one=[]
diff=[]
for idx in range(1,M):
    # print(idx)
    as0=sum(data[(idx-1)*clus:(idx)*clus])/clus
    as1=sum(data[(idx)*clus:(idx+1)*clus])/clus
    As.append(as0) # just for me
    As_one.append(as1) #just for me

    diff.append((as1-as0))

sig2.append(np.mean(np.array(diff)**2)/2)`

and using your algorithm taus, avar = allan_variance(data,min_cluster_size=1, min_cluster_count=1)

in this case, we would have in principle the same output

but using standard AllanTools variance we will end up with a completely different result: which is clear, as the input data-type is different (should either be cumsumed before or switched to freq).

PS Some comments on this can be probably added in the function description to make it more clear, although I believe now that the standard data type for the input array that you use is more correct, it will be right to stay with it (or simple if statement for datatype can be added), but it is just suggestions.

Best regards and thank you for the discussions, Ivan

nmayorov commented 3 years ago

Bases on our discussion I've added input_type='integral'.

Also changed the default to input_type='mean', because this is what most often measured in practice. In fact filtered and sampled input signal is usually used, but it is not trivial to explain how filtering affects the underlying signal and what we measure in this case. So I've kept the name 'mean', because in this case everything is theoretically sound.

Thanks for the discussion.

nmayorov / allan-variance

Question about function arguments -'X' vs 'x' #3

print(clus)