Closed msilvafe closed 2 years ago
I'll work on this -- can you send the start/stop times for the (at least one of the weird) data above so I can test what I come up with?
I wonder if the "right" way to do this is to make use of the existing gapfill tools I see in tod_ops? Or, since this will mostly be filling in individual or a few missing points, is the linear interpolation that I was using is better/simpler than the polynomial gap filling?
Another way to approach this is to treat it as something that is by default done to data before the psd function takes the psd, and just use the resample function that is being worked on (or some extension of it) to resample the data to a completely regular data rate, though if that's done wrong it sort of does a weird averaging of things that can reduce some of the highest frequency information in the data.
@jeffiuliano these are from SAT1 so use this archive:
arc = G3tSmurf("/mnt/so1/data/ucsd-sat1/timestreams", meta_path="/mnt/so1/data/ucsd-sat1/smurf", db_path='/home/jlashner/indexes/sat1_db_ext.db')
The ctimes for each step I saved to a pickle that you can read from my simons1 user directory:
noise_ctimes = pkl.load(open('/home/msilvafe/analysis_notebooks/P6R1/1618816579_P6R1_Noise_vs_Phase_info.pkl','rb'))
This is a nested dictionary with keys:
['epa', 0.0, 0.1, 0.2, 0.30000000000000004, 0.4, 0.5, 0.6000000000000001, 0.7000000000000001, 0.8, 0.9, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0]
The keys that are floats follow phase_deltas = np.append(np.arange(0,1,0.1), np.arange(1,20,1))
and each of those keys contain a dictionary with keys: ['pa', 'stime', 'etime']
The stime
and etime
values is what you can use for loading each segment from the archive. The notebook I was using is also here: /home/msilvafe/analysis_notebooks/P6R1/20210419_Noise_vs_Phase.ipynb
on simons1 if you want to steal any other cells.
Apologies that its a bit messy...
Thanks!
Another way to approach this is to treat it as something that is by default done to data before the psd function takes the psd, and just use the resample function that is being worked on (or some extension of it) to resample the data to a completely regular data rate, though if that's done wrong it sort of does a weird averaging of things that can reduce some of the highest frequency information in the data.
I'm not sure we want to have it be a default feature. I feel like you can induce some weird errors if you end up having to interpolate too many points and do it wrong.
I wonder if the "right" way to do this is to make use of the existing gapfill tools I see in tod_ops? Or, since this will mostly be filling in individual or a few missing points, is the linear interpolation that I was using is better/simpler than the polynomial gap filling?
I like the idea of using existing tools. I don't really understand the merits of one vs the other though so can't really comment. Maybe we could just extend the existing tools to have an optional argument that changes how the gap filling is done?
@kmharrington , should your recent changes (using load_file as the back end) have affected this? Because I can reproduce Max's plots when I use the old (ie module on simons1) version of sotodlib, but when I use my up-to-date with master git version, I get something like this (for the .1 Phase Delta plot above).
Almost like something is regularizing the timestamps now?
Lurker here... couple of notes...
It may be dangerous to use a test based on stdev of the timestamp differences. Currently there is a fair bit of noise on the timestamps because they are "software grade". That gives you a nice fuzzy stdev to work with. But when we move to hardware timestamps, or if we change the loader code to compute timestamps from the "Frame counter", then the noise will become tiny, the deviations may not be gaussian, and 5*stdev (or whatever) might not be as unlikely as you expect (in ordinary, "good" data).
I encourage you to work with the "Frame counter", rather than the timestamp. (It's possible to use Frame Counter to correct the timestamp on load -- this has been suggested before but doesn't seem to be in place in the data you're looking at (P.S. -- it seems you've stumbled on this new feature now :) )).
Not all timestamp jumps of ~10 ms mean a frame has been dropped. You might have just piled one up (e.g. diff(timestamp) = [0.005, 0.005, 0.009, .001, .005 ...]). But diffs in Frame counter of > 1.5 * median almost certainly mean that some frames got dropped.
Also ... in case it's helpful here's some code from the ACT data merging library (which is in a private repo). It is used to solve this exact problem, identifying missing samples when all you have are a vector of timestamps. The top function is the most relevant one.
https://gist.github.com/mhasself/e2abc0749ac4552085d37734ef97d007
I encourage gap-filling using the gap-filling tools in sotodlib. You can tweak the settings so the "polynomial" is just a straight line. There are C++ extensions coming soon (just finishing testing) to make that ultra-fast.
As Max points out, long gaps are a huge problem. The PSD code should reject data that has long gaps (say, covering more than 1% of the segment time), entirely, rather than doing its best to patch things in all cases.
That's very helpful -- so are you saying that the timestamp correction from Frame Counter has been implemented? (And so that's what we're seeing)?
For clarity, this difference in how timestamps look when I was loading data was due to a bug, which is fixed in commit c0492b8f0d8d255e21b998cf96fb22cc834bf428
So, thinking more about this, almost all tools we're using for psds will assume that the underlying data is regularly sampled, so it's not just missing points that can mess it up, but any non-regularity in the timestamps sampling. Assuming the timestamps we see actually reflect how the underlying signal is sampled, we could try to correct for the variation. You wouldn't necessarily want to do this for everything (especially if here are large gaps), but doing something like this:
def regularize(times,data):
DT = np.median(np.diff(times))
t = np.arange(min(times),max(times),DT)
interp=interp1d(times,data,fill_value='extrapolate')
return t,interp(t)
before taking the psd makes a big difference: left is the raw data, right is the regularized data, which no longer has white noise levels dependent on nperseg.
This deals not only with missing data, but also weird sampling. Or is doing this sort of thing even just to make the psd work better a bad idea?
The underlying data are regularly sampled (or very close to it); it's the timestamps that are wrong. I think the improvement you see is due to the large gaps being filled. The small corrections from random variation in the timestamps is really not likely to make a big difference.
Sounds like this was all fixed by fixing the bugs in the new get_timestamps
function? Are we good to close this issue?
No I think we still need something like this. I just haven't gotten the right thing written yet. Even with the new timestamp loader there can still be missing sample in the data, which mess up PSDs
As a reference (since I had the data loaded for something else). Here's a diff of the current timestamp functions on two straight hours of data.
So I haven't forgotten this. I actually think there might be something off about the data that @msilvafe was looking at, beyond a missing data point. For example, looking at the phase delta = 0.0 data: One good way to look for missing data (instead of just some wobble in timestamp separation) is to look at the difference between the timestamps and a line with slope of the average timestamp separation [ timestamps - arange(len(timestamps))*median(diff(timestamps)) ].
You can clearly see the missing timestamp, and then that it's fixed (blue -> orange). but, even fixing this, the PSD is not doing what I would expect. Before/After:
But I've been looking at the data and haven't been able to figure out what else is wrong with. For now I'm going to move ahead with just a missing sample fixer, as it addresses at least part of the problem.
Hmm interesting @jeffiuliano I thought that in the above comment (link) you showed you had fixed it in the same dataset you show having remaining issues in the white noise levels in the psd here. What is different between the gap filling you did there vs here?
Yeah that's a good point. The "fix" that made the psd look good was when I completely resampled it at a regular rate using that "regularize" function I pasted above it. In the above post, I just added a single point where there was an apparent gap (and interpolated values for that one point).
It's an interesting observation that the "regularize" code seems to fix the issue, while filling in this apparently missing sample doesn't. I wonder what else is being fixed when I do the more involved resampling? Looking at the data directly, it actually looks pretty good. Top/Bottom being before/after replacing the missing point. It doesn't look like there are large irregularities in the sampling that the regularize funciton would be fixing.
I don't really understand exactly what I'm looking at here can you explain what the x and y axes and colorscale are?
axes are just data vs time, the color scale is my estimated timestamp error (where you can see the jump in the blue vs orange line a couple plots up).
I looks like a horizontal wave bc of the drift in phase between sampling rate and the sine you were driving
Which is actually super useful as it makes weird samples stick out
Gotcha, I do still see a few sample that don't match the color scale of the rest:
Are the two colorscales the same? Maybe its those 3 or so points that are still off plus this color gradient that goes from yellow-green to purple in the bottom plot I guess because of the error in the linear fit?
@msilvafe , sorry for not getting back to you faster. No, the two colorbars are not the same. Yeah I noticed those ones with funny-looking timestamps, but if you zoom in on them, the data seems perfectly lined up. So it seems like some jitter in the timstamping rather than anything in the actual data.
Slowly drifting error I'm assuming comes from the difference between the median timstamp diff and the actual data-rate (where the median is being biased somehow)
Ok, so from that I think I agree with you that those couple points are just jitter and are corrected just fine but should we be concerned about this bias between the median timestamp diff and actual data-rate? Is there some better metric we can use that like excludes outliers or tails in the timestamp diff distribution better?
Yeah I think so. I'm looking at binning sufficiently finely now and using the mode (though I would have thought the median would be close to that it if it's a sufficiently peaked distribution)
Interesting histogram:
So if I just assume the proper spacing is .0050, (instead of the median of 0.0050001) the drift goes away:
Ok...interesting. So can you get that 0.005 number from the mode or did you just choose it by eye? Is it reasonable to take the mode and then round to some decimal place?
I just picked it by eye since the sample rate we choose going to be a round number (eg 200 Hz -- which is something we probably knew already)
Median and then rounding might be sufficent/simple if we want to do something other than hard-code a known sample-rate
Maybe the best bet is actually just grabbing the sample rate from the info in the status frames instead of calculating from the distribution of diff timestamps
Is this still an issue we need to have open here? Or are we in a good enough place to use wait to actually get SMuRF timing systems?
Hi, let's close this (or re-formulate with a specific request). Interesting discussion, but.
Yes, I think we can close it. If anything the specific request would be to have g3tsmurf load a field of timestamps derived from the frame counter like
fsamp, offset = np.polyfit(am.primary["FrameCounter"], am.timestamps, 1)
timestamps_corrected = offset + fsamp*am.primary ["FrameCounter"]
am.wrap("timestamps_corrected", timestamps_corrected, [(0,"samps")])
I'm fine with closing this. I haven't seen any obvious issues related to this in our data since this discussion, and if the actual timing system will change how we do timestamps anyway, then I think we're fine.
Ok. I think Max's suggestion is a good one (though I'd call it "timestamps_linearized"). Please create that issue if you Smurf fans think it would be useful.
Skip to bottom Requested solution in sotodlib if you don't care to see how we arrived at creating this issue.
Identifying Issue
Recently both @jeffiuliano and I saw these missing samples effecting some of our PSDs and skewing some of our analysis. From my analysis I noticed this by finding that above some
nperseg
value in myscipy.signal.welch
estimation of the PSD the sort of baseline level (above 2^10 and the data length was 2^12) would blow up:So I then checked if each of the my 20 sec segments showed the same problem. Each of the plots reading left to right and top to bottom are different 20 sec timestreams. The curves within each plot are different
nperseg
chosen for thescipy.signal.welch
function and the colorscale matches the legend in the above plot. Notice than only a few of the plots (namely the 1st, 7th, 8th, and 13th) show the above mentioned behavior.We then plotted the
np.diff(timestamps)
for each of these 20s segments and found that the only in the 20s data segments where samples were missing (Max dt_samp >= 10 mS, typical is 5 mS, max value listed in plot title) did we see this issue:@jeffiuliano also had some similarly strange results in some of his data analysis where he was looking at PSDs and found that the issues when away when he reanalyzed the same data later. He added to his default data loading function a tool that masks and then interpolates to replaces data points where the timestamp differences are are more than something like 100 times the normal timestamp differences. He didn't try to revert and reanalyze as far as I know but that seemed to be a key difference that fixed his issues.
Solution in
pysmurf
This issue should mostly be fixed with the new pysmurf release which addressed many dropped frames issues (in particular see: pysmurf PR #647). Although there are multiple ways that we could drop samples or frames so we would like to keep open an option for correcting these sort of data issues in
sotodlib
.Requested solution in
sotodlib
Ideally we would have a function which does a check for points that satisfy
np.diff(timestamp) > 5*np.std(np.diff(timestamps))
(or the number of std could be an argument). Then the missing points could be interpolated and returned as a new or overwritten axis in your axis manager.@jeffiuliano has implemented a function mask_time_error in the LATR_validation repo that masks and fills bad points with NaNs, and another function, interpolate_nans, that will interpolate those NaNs. This is probably a good starting point of something to implement.