neurodsp-tools / neurodsp

Digital signal processing for neural time series.
https://neurodsp-tools.github.io/
Apache License 2.0
285 stars 62 forks source link

[BUG] sim_oscillation bug in power spectrum #223

Open ryanhammonds opened 4 years ago

ryanhammonds commented 4 years ago

I was simulating sine waves and computing spectra:

from neurodsp.spectral import compute_spectrum
from neurodsp.plts import plot_time_series, plot_power_spectra
from neurodsp.utils import create_times
from neurodsp.sim import sim_oscillation

n_seconds = 10
fs = 1000

times = create_times(n_seconds, fs)
sig_8hz = sim_oscillation(n_seconds, fs, 8)
sig_9hz = sim_oscillation(n_seconds, fs, 9)

freqs_8hz, spec_8hz = compute_spectrum(sig_8hz, fs)
freqs_9hz, spec_9hz = compute_spectrum(sig_9hz, fs)

plot_power_spectra(freqs_8hz, spec_8hz, title='8hz')
plot_power_spectra(freqs_9hz, spec_9hz, title='9hz')

And noticed strange spectra for odd freq waves: spec

@rdgao Tracked this down to this line cycle = sim_cycle(n_seconds_cycle, fs, cycle, **cycle_params) in sim_oscillation and an interaction with sampling rate. When fs is increased to a large value, the odd spectra are corrected.

rdgao commented 4 years ago

yup, and just adding a comment:

I suspect this is because when an oscillation with a frequency that doesn't evenly divide the sampling rate (e.g., 9Hz and fs=1000Hz), our current approach creates non-perfect cycles of oscillations. They are either a little more than a cycle or a little less, and when you tile them to make the full signal, even if the signal length supports an integer number of cycles, as is the case here, you don't end up with the correct signal (the time series plots @ryanhammonds made were pretty diagnostic).

Another way to say this: even though you can get 9 complete cycles of sine wave in 1 second sampled at 1000Hz, 1000 doesn't divide into 9, so depending on if you simulate all 9 cycles of 1 second (which will end at the correct place), or 1 cycle then tile it 9 times, the final signal will look different.

I suggested trying fs = 7200 which fixed the problem, because it's an integer multiple of both 8 and 9, so each individual cycle can be completed in an integer number of samples. I don't think this will work with arbitrary large sampling rates, in general.

rdgao commented 4 years ago

I think this is related to #189

TomDonoghue commented 4 years ago

I can answer better sometime later - but I think this isn't a bug per se, just, as Richard points out, an estimation thing - not all combos of frequency / fs / time length work out evenly. When they don't, you things like the shape of PSD seen above for 9 hz. This isn't particular to concatenation - happens for other approaches for simulating continuous sine waves too. I have a notebook poking at this somewhere. Exactly when it happens can vary a little between sim approach (depending how thing line up, because there can be small differences based on simulation per cycle, or whole signal together), but overall this is just a thing that happens with creating arbitrary discrete signals - and I don't think there's much to do about it...?

rdgao commented 4 years ago

but this doesn't happen when you just call np.sin() to make the signal for the duration of the time period, as long as it's an integer number of cycles, right?

if a cycle is clipped at the edge, it will give you something weird through leakage, but it shouldn't happen otherwise

ryanhammonds commented 4 years ago

@rdgao Your explanation clicked after simulating with np.sin. Here is a quick example that shows what's happening:

import numpy as np
from neurodsp.sim import sim_oscillation
from neurodsp.utils.decorators import normalize
from neurodsp.utils import create_times
from neurodsp.plts import plot_time_series, plot_power_spectra
from neurodsp.spectral import compute_spectrum

fs = 1000
freq = 9
n_seconds = 10
times = create_times(n_seconds, fs)

@normalize
def sim_oscillation_np_sin(n_seconds, fs, freq):
    sig = np.sin(2 * np.pi * freq * np.arange(n_seconds * fs) / fs)
    return sig

sig_new = sim_oscillation_np_sin(n_seconds, fs, freq)
sig_old = sim_oscillation(n_seconds, fs, freq)

plot_time_series(times, [sig_new, sig_old], labels=['np.sin sig', 'np.tile sig'], xlim=(0, 1))
freqs, powers = compute_spectrum(sig_new, fs)
plot_power_spectra(freqs, powers)

fix_ts

The spectrum is fixed when using np.sin:

fix_spec

I'll make a PR soon with a fix.

TomDonoghue commented 4 years ago

There is a small frequency difference between what gets simulated when using tile vs. non-tile (which looks like the phase shift / drift in the time series above). In this case, the non-tile is "more correct", but it's not necessarily a categorical error, but rather an estimation imprecision with tiling.

Both can end up with the seemingly odd spectra depending on the length of time simulated. For example, if you sim the non-tiled signal above with something like 10.1 seconds, I think you'll see basically the same slightly funky spectrum.

It's not clear if / how we want to update anything w.r.t. this, because tiling is how we do both different cycles, and burstiness - and so it's not clear what to change (without a big overhaul). I think adding an option to do non-tiled sinusoids might make sense. But this probably doesn't want to replace everything else we've got that does use tiling. We can end up with those skewed spectra no matter what, but given the magnitudes of power outside the simulated frequency, it probably doesn't really matter (?).

It has been on my mind / list of ToDos to at least annotate somewhere some of the limitations of the tiling approach (the slight frequency drift). But as long as one doesn't try to set up precise phase relationships, I don't think this is necessarily a huge issue.

rdgao commented 4 years ago

yeah, if there are windows in your STFT that got non-integer number of cycles (as would be the case if you simulated a sinusoid continuously for a non-integer number of cycles in total, say 9.1s at fs=1000), it will give you a bit of a weird spectra as well, depending on your compute_spectrum window and step size.

the weird spectra is independent from the tiling problem, BUT the tiling approach basically guarantees that this will happen, because each cycle of the sine is incomplete by definition.

for the continuous sine simulation, we definitely want to use the continuous approach, even if that means having an if statement that checks for 'sine' and calls np.sin() instead. Again, this is not just wrong in the sense that it finished on an incomplete cycle, but wrong in the sense that every cycle is incomplete.

a more difficult question is whether the tiling approach is correct in general, and whether we can do any better for the non-sinusoidal cases. Ideally, we'd want to define a (arbitrary) phase series, and transform it back into the signal you want through np.sine. In the case of actual sine waves, the phase would have linear gain (so just np.wrap(t), or whatever the function is). In the other cases, because you are shooting for a nonlinear phase gain in the first place (for shapier waves), that little bit of error probably doesn't matter much, but you will get a little bit of extra stuff that you didn't pay for at the concatenation points. Solving this problem for our simulation would sorta imply that we have the correct generative model (for the data), which would be a non-trivial scientific advance.

rdgao commented 4 years ago

one more note - the weird spectra is really only a problem if you want to start looking at shape-slope interactions, because as Tom pointed out, the magnitude is very different so the signal isn't really "corrupt" in any meaningful way. But judging by eye, that first PSD has a slope of -6, which is in the conceivable range of ECoG values. Basically, it'd be weird if you simulated a pure sine, fooofed it, and got a non-zero slope, and that kind of error will similarly propagate into any shapey-slopey investigations, but I don't have a good idea of how to fix it bc we don't analytically know which shapes have which slopes in the first place (aside from the pure sine).

TomDonoghue commented 4 years ago

I agree with everything other than "every cycle is incomplete". Unless, I'm mis-understanding something, I don't think this is the case. As far as I know, the single cycle is fine overall, just the frequency of the single cycle can be slightly off, which creates the discrepancy of number of cycles within a given time window. (For some frequencies, conditioned on sampling rate, the ideal number of samples for the single cycle is non-integer, so it gets rounded, creating a well formed, but slightly different frequency cycle, that then gets tiled, but that doesn't quite line up in the expected number of cycles. I think this is why tiled-8Hz is fine (see spectrum above), but tiled-9hz has the quirk).

I agree with adding an option for continuous sine sims. As for tiling in general, my impression is that it has these estimation limitations, and we should keep these in mind (including, for example, in some of the mentioned cases looking at slope), but that they are broadly relatively minor limitations, to be expected from finite / discrete simulations. If we had a clear alternative, we could / should go for it, but as of right now, perhaps this is the best we can do (pending much more involved work)?

I think I talked a little about this with @elybrand at some point - I would also be curious what he thinks on this.

rdgao commented 4 years ago

what I mean is this: at some oscillation frequency-to-sampling frequency ratios, you need a non-integer number of samples to complete a single cycle. in Ryan's original example, fs=1000 and freq=9, which means 1 cycle requires 1000/9 = 111.111~ samples. This would not be an issue if you simulate all 9 cycles in one go, as long as sampling rate is an integer, bc you end up with an even 1000/9 * 9 = 1000 samples, and always will when you simulate 1 second of data for an integer frequency. But 111.111.... samples is impossible because it's not a discrete number, hence the incomplete (or overcomplete) number of points. In other words, converting the continuous sine into discrete sine needs special time values that are not possible in non-integer ratios of frequency and sampling rate.

Just tracking this further:

sim_oscillations: n_seconds_cycle = int(np.ceil(fs / freq)) / fs gives you 0.112s. (I think removing the ceil might actually solve it, because that number (n_seconds_cycle) doesn't need to be nice in any way, so there's no reason to round it prior to dividing by fs again.

then in sim_cycle: times = create_cycle_time(n_seconds, fs) does this: 2 * np.pi * 1 / n_seconds * (np.arange(fs * n_seconds) / fs), and everything after the 2pi is 0 to 0.111s at 0.001 spacing, and the spacing of the sine function between that value at 0.111 to the value at the next 0 is not the same as the spacing between every other two points, hence the slight discontinuity. Judging by the time series plot, we get a little bit extra

elybrand commented 4 years ago

Here is an overly verbose response. Basically, @rdgao's first post sums it up.

sim_oscillations: n_seconds_cycle = int(np.ceil(fs / freq)) / fs gives you 0.112s. (I think removing the ceil might actually solve it, because that number (n_seconds_cycle) doesn't need to be nice in any way, so there's no reason to round it prior to dividing by fs again.

Removing the ceil has the undesired effect that the output signal will not be the correct length. For example, when n_seconds = 10, fs = 1000, there will be 90 cycles each of length 0.11 seconds which only gives you 9.99s rather than 10s. You need the ceil to get a signal whose length is at least 10 seconds, and then you can truncate as is done in the code.

@rdgao is on the right track though. In sim_oscillation, every cycle in theory should take 1/freq seconds. Of course, you may not have sampled at a rate proportional to 1/freq. By construction, the tiling approach assumes that there exists some integer n such that the fundamental domain of your signal is the interval [0 seconds, n/fs seconds]; that is, sig[kn : (k+1)n] == sig[jn : (j+1)n] for all j, k. This is what lets you independently patch tiles together without taking into account what's happening at the boundary of tiles. As we've observed above though, the tiling approach only matches the pure sine case when 1/freq = n/fs, i.e. fs = n * freq.

When the sampling rate is not an integer multiple of the frequency, then n_seconds_cycle = int(np.ceil(fs / freq)) / fs > 1/freq. The tiling approach then constructs a uniform grid of time samples in the interval [0, n_seconds_cycle] with spacing 1/fs when the pure sinusoidal samples are coming from uniform time samples in [0, 1/freq] with spacing 1/fs. Hence the phase shift.

Is there a way to remedy the tiling approach to handle the case when fs != n*freq for general signals? I think the answer has to be no. Strictly speaking, in the case where our samples come from the pure sine wave, which is periodic over the continuum, the discrete samples are only periodic by definition when there exists some integer n such that sig[kn : (k+1)n] == sig[jn : (j+1)n] for all j, k.

TL;DR: Discrete samples from a periodic function do not necessarily generate a discrete periodic function per se. In cases such as this, the tiling approach induces a phase shift.