simonsobs / sotodlib

Simons Observatory: Time-Ordered Data processing library.
MIT License
16 stars 19 forks source link

Fill in missed samples #127

Closed msilvafe closed 2 years ago

msilvafe commented 3 years ago

Skip to bottom Requested solution in sotodlib if you don't care to see how we arrived at creating this issue.

Identifying Issue

Recently both @jeffiuliano and I saw these missing samples effecting some of our PSDs and skewing some of our analysis. From my analysis I noticed this by finding that above some nperseg value in my scipy.signal.welch estimation of the PSD the sort of baseline level (above 2^10 and the data length was 2^12) would blow up: image (62)

So I then checked if each of the my 20 sec segments showed the same problem. Each of the plots reading left to right and top to bottom are different 20 sec timestreams. The curves within each plot are different nperseg chosen for the scipy.signal.welch function and the colorscale matches the legend in the above plot. Notice than only a few of the plots (namely the 1st, 7th, 8th, and 13th) show the above mentioned behavior. image (61)

We then plotted the np.diff(timestamps) for each of these 20s segments and found that the only in the 20s data segments where samples were missing (Max dt_samp >= 10 mS, typical is 5 mS, max value listed in plot title) did we see this issue: image (60)

@jeffiuliano also had some similarly strange results in some of his data analysis where he was looking at PSDs and found that the issues when away when he reanalyzed the same data later. He added to his default data loading function a tool that masks and then interpolates to replaces data points where the timestamp differences are are more than something like 100 times the normal timestamp differences. He didn't try to revert and reanalyze as far as I know but that seemed to be a key difference that fixed his issues.

Solution in pysmurf

This issue should mostly be fixed with the new pysmurf release which addressed many dropped frames issues (in particular see: pysmurf PR #647). Although there are multiple ways that we could drop samples or frames so we would like to keep open an option for correcting these sort of data issues in sotodlib.

Requested solution in sotodlib

Ideally we would have a function which does a check for points that satisfy np.diff(timestamp) > 5*np.std(np.diff(timestamps)) (or the number of std could be an argument). Then the missing points could be interpolated and returned as a new or overwritten axis in your axis manager.

@jeffiuliano has implemented a function mask_time_error in the LATR_validation repo that masks and fills bad points with NaNs, and another function, interpolate_nans, that will interpolate those NaNs. This is probably a good starting point of something to implement.

jeffiuliano commented 3 years ago

I'll work on this -- can you send the start/stop times for the (at least one of the weird) data above so I can test what I come up with?

jeffiuliano commented 3 years ago

I wonder if the "right" way to do this is to make use of the existing gapfill tools I see in tod_ops? Or, since this will mostly be filling in individual or a few missing points, is the linear interpolation that I was using is better/simpler than the polynomial gap filling?

jeffiuliano commented 3 years ago

Another way to approach this is to treat it as something that is by default done to data before the psd function takes the psd, and just use the resample function that is being worked on (or some extension of it) to resample the data to a completely regular data rate, though if that's done wrong it sort of does a weird averaging of things that can reduce some of the highest frequency information in the data.

msilvafe commented 3 years ago

@jeffiuliano these are from SAT1 so use this archive: arc = G3tSmurf("/mnt/so1/data/ucsd-sat1/timestreams", meta_path="/mnt/so1/data/ucsd-sat1/smurf", db_path='/home/jlashner/indexes/sat1_db_ext.db')

The ctimes for each step I saved to a pickle that you can read from my simons1 user directory: noise_ctimes = pkl.load(open('/home/msilvafe/analysis_notebooks/P6R1/1618816579_P6R1_Noise_vs_Phase_info.pkl','rb'))

This is a nested dictionary with keys: ['epa', 0.0, 0.1, 0.2, 0.30000000000000004, 0.4, 0.5, 0.6000000000000001, 0.7000000000000001, 0.8, 0.9, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0]

The keys that are floats follow phase_deltas = np.append(np.arange(0,1,0.1), np.arange(1,20,1)) and each of those keys contain a dictionary with keys: ['pa', 'stime', 'etime']

The stime and etime values is what you can use for loading each segment from the archive. The notebook I was using is also here: /home/msilvafe/analysis_notebooks/P6R1/20210419_Noise_vs_Phase.ipynb on simons1 if you want to steal any other cells.

Apologies that its a bit messy...

jeffiuliano commented 3 years ago

Thanks!

msilvafe commented 3 years ago

Another way to approach this is to treat it as something that is by default done to data before the psd function takes the psd, and just use the resample function that is being worked on (or some extension of it) to resample the data to a completely regular data rate, though if that's done wrong it sort of does a weird averaging of things that can reduce some of the highest frequency information in the data.

I'm not sure we want to have it be a default feature. I feel like you can induce some weird errors if you end up having to interpolate too many points and do it wrong.

I wonder if the "right" way to do this is to make use of the existing gapfill tools I see in tod_ops? Or, since this will mostly be filling in individual or a few missing points, is the linear interpolation that I was using is better/simpler than the polynomial gap filling?

I like the idea of using existing tools. I don't really understand the merits of one vs the other though so can't really comment. Maybe we could just extend the existing tools to have an optional argument that changes how the gap filling is done?

jeffiuliano commented 3 years ago

@kmharrington , should your recent changes (using load_file as the back end) have affected this? Because I can reproduce Max's plots when I use the old (ie module on simons1) version of sotodlib, but when I use my up-to-date with master git version, I get something like this (for the .1 Phase Delta plot above). image

Almost like something is regularizing the timestamps now?

mhasself commented 3 years ago

Lurker here... couple of notes...

It may be dangerous to use a test based on stdev of the timestamp differences. Currently there is a fair bit of noise on the timestamps because they are "software grade". That gives you a nice fuzzy stdev to work with. But when we move to hardware timestamps, or if we change the loader code to compute timestamps from the "Frame counter", then the noise will become tiny, the deviations may not be gaussian, and 5*stdev (or whatever) might not be as unlikely as you expect (in ordinary, "good" data).

I encourage you to work with the "Frame counter", rather than the timestamp. (It's possible to use Frame Counter to correct the timestamp on load -- this has been suggested before but doesn't seem to be in place in the data you're looking at (P.S. -- it seems you've stumbled on this new feature now :) )).

Not all timestamp jumps of ~10 ms mean a frame has been dropped. You might have just piled one up (e.g. diff(timestamp) = [0.005, 0.005, 0.009, .001, .005 ...]). But diffs in Frame counter of > 1.5 * median almost certainly mean that some frames got dropped.

Also ... in case it's helpful here's some code from the ACT data merging library (which is in a private repo). It is used to solve this exact problem, identifying missing samples when all you have are a vector of timestamps. The top function is the most relevant one.

https://gist.github.com/mhasself/e2abc0749ac4552085d37734ef97d007

I encourage gap-filling using the gap-filling tools in sotodlib. You can tweak the settings so the "polynomial" is just a straight line. There are C++ extensions coming soon (just finishing testing) to make that ultra-fast.

As Max points out, long gaps are a huge problem. The PSD code should reject data that has long gaps (say, covering more than 1% of the segment time), entirely, rather than doing its best to patch things in all cases.

jeffiuliano commented 3 years ago

That's very helpful -- so are you saying that the timestamp correction from Frame Counter has been implemented? (And so that's what we're seeing)?

jeffiuliano commented 3 years ago

For clarity, this difference in how timestamps look when I was loading data was due to a bug, which is fixed in commit c0492b8f0d8d255e21b998cf96fb22cc834bf428

jeffiuliano commented 3 years ago

So, thinking more about this, almost all tools we're using for psds will assume that the underlying data is regularly sampled, so it's not just missing points that can mess it up, but any non-regularity in the timestamps sampling. Assuming the timestamps we see actually reflect how the underlying signal is sampled, we could try to correct for the variation. You wouldn't necessarily want to do this for everything (especially if here are large gaps), but doing something like this:

def regularize(times,data):
    DT = np.median(np.diff(times))
    t = np.arange(min(times),max(times),DT)
    interp=interp1d(times,data,fill_value='extrapolate')
    return t,interp(t)

before taking the psd makes a big difference: left is the raw data, right is the regularized data, which no longer has white noise levels dependent on nperseg. image

This deals not only with missing data, but also weird sampling. Or is doing this sort of thing even just to make the psd work better a bad idea?

mhasself commented 3 years ago

The underlying data are regularly sampled (or very close to it); it's the timestamps that are wrong. I think the improvement you see is due to the large gaps being filled. The small corrections from random variation in the timestamps is really not likely to make a big difference.

kmharrington commented 3 years ago

Sounds like this was all fixed by fixing the bugs in the new get_timestamps function? Are we good to close this issue?

jeffiuliano commented 3 years ago

No I think we still need something like this. I just haven't gotten the right thing written yet. Even with the new timestamp loader there can still be missing sample in the data, which mess up PSDs

kmharrington commented 3 years ago

As a reference (since I had the data loaded for something else). Here's a diff of the current timestamp functions on two straight hours of data. image

jeffiuliano commented 3 years ago

So I haven't forgotten this. I actually think there might be something off about the data that @msilvafe was looking at, beyond a missing data point. For example, looking at the phase delta = 0.0 data: One good way to look for missing data (instead of just some wobble in timestamp separation) is to look at the difference between the timestamps and a line with slope of the average timestamp separation [ timestamps - arange(len(timestamps))*median(diff(timestamps)) ].

image

You can clearly see the missing timestamp, and then that it's fixed (blue -> orange). but, even fixing this, the PSD is not doing what I would expect. Before/After: image

But I've been looking at the data and haven't been able to figure out what else is wrong with. For now I'm going to move ahead with just a missing sample fixer, as it addresses at least part of the problem.

msilvafe commented 3 years ago

Hmm interesting @jeffiuliano I thought that in the above comment (link) you showed you had fixed it in the same dataset you show having remaining issues in the white noise levels in the psd here. What is different between the gap filling you did there vs here?

jeffiuliano commented 3 years ago

Yeah that's a good point. The "fix" that made the psd look good was when I completely resampled it at a regular rate using that "regularize" function I pasted above it. In the above post, I just added a single point where there was an apparent gap (and interpolated values for that one point).

It's an interesting observation that the "regularize" code seems to fix the issue, while filling in this apparently missing sample doesn't. I wonder what else is being fixed when I do the more involved resampling? Looking at the data directly, it actually looks pretty good. Top/Bottom being before/after replacing the missing point. It doesn't look like there are large irregularities in the sampling that the regularize funciton would be fixing.

image

msilvafe commented 3 years ago

I don't really understand exactly what I'm looking at here can you explain what the x and y axes and colorscale are?

jeffiuliano commented 3 years ago

axes are just data vs time, the color scale is my estimated timestamp error (where you can see the jump in the blue vs orange line a couple plots up).

jeffiuliano commented 3 years ago

I looks like a horizontal wave bc of the drift in phase between sampling rate and the sine you were driving

jeffiuliano commented 3 years ago

Which is actually super useful as it makes weird samples stick out

msilvafe commented 3 years ago

Gotcha, I do still see a few sample that don't match the color scale of the rest:

Untitled drawing.pdf

Are the two colorscales the same? Maybe its those 3 or so points that are still off plus this color gradient that goes from yellow-green to purple in the bottom plot I guess because of the error in the linear fit?

jeffiuliano commented 3 years ago

@msilvafe , sorry for not getting back to you faster. No, the two colorbars are not the same. Yeah I noticed those ones with funny-looking timestamps, but if you zoom in on them, the data seems perfectly lined up. So it seems like some jitter in the timstamping rather than anything in the actual data.

image image

Slowly drifting error I'm assuming comes from the difference between the median timstamp diff and the actual data-rate (where the median is being biased somehow)

msilvafe commented 3 years ago

Ok, so from that I think I agree with you that those couple points are just jitter and are corrected just fine but should we be concerned about this bias between the median timestamp diff and actual data-rate? Is there some better metric we can use that like excludes outliers or tails in the timestamp diff distribution better?

jeffiuliano commented 3 years ago

Yeah I think so. I'm looking at binning sufficiently finely now and using the mode (though I would have thought the median would be close to that it if it's a sufficiently peaked distribution)

jeffiuliano commented 3 years ago

Interesting histogram: image

jeffiuliano commented 3 years ago

So if I just assume the proper spacing is .0050, (instead of the median of 0.0050001) the drift goes away:

image

msilvafe commented 3 years ago

Ok...interesting. So can you get that 0.005 number from the mode or did you just choose it by eye? Is it reasonable to take the mode and then round to some decimal place?

jeffiuliano commented 3 years ago

I just picked it by eye since the sample rate we choose going to be a round number (eg 200 Hz -- which is something we probably knew already)

jeffiuliano commented 3 years ago

Median and then rounding might be sufficent/simple if we want to do something other than hard-code a known sample-rate

msilvafe commented 3 years ago

Maybe the best bet is actually just grabbing the sample rate from the info in the status frames instead of calculating from the distribution of diff timestamps

kmharrington commented 3 years ago

Is this still an issue we need to have open here? Or are we in a good enough place to use wait to actually get SMuRF timing systems?

mhasself commented 2 years ago

Hi, let's close this (or re-formulate with a specific request). Interesting discussion, but.

msilvafe commented 2 years ago

Yes, I think we can close it. If anything the specific request would be to have g3tsmurf load a field of timestamps derived from the frame counter like

fsamp, offset = np.polyfit(am.primary["FrameCounter"], am.timestamps, 1)
timestamps_corrected = offset + fsamp*am.primary ["FrameCounter"]

am.wrap("timestamps_corrected", timestamps_corrected, [(0,"samps")])
jeffiuliano commented 2 years ago

I'm fine with closing this. I haven't seen any obvious issues related to this in our data since this discussion, and if the actual timing system will change how we do timestamps anyway, then I think we're fine.

mhasself commented 2 years ago

Ok. I think Max's suggestion is a good one (though I'd call it "timestamps_linearized"). Please create that issue if you Smurf fans think it would be useful.