ka9q / ka9q-radio

Multichannel SDR based on fast convolution and IP multicasting
GNU General Public License v3.0
161 stars 36 forks source link

estimate_n0 reads unititialized memory #3

Closed argilo closed 2 years ago

argilo commented 2 years ago

Valgrind reports that estimate_n0 reads uninitialized memory:

==12481== Conditional jump or move depends on uninitialised value(s)
==12481==    at 0x1134D5: estimate_n0 (radio.c:137)
==12481==    by 0x5C906DA: start_thread (pthread_create.c:463)
==12481==    by 0x636761E: clone (clone.S:95)

It seems this is because fdomain is not initialized. I don't yet know why that occurs.

ka9q commented 2 years ago

I think I can see the problem. master->fdomain[] is the frequency domain signal written by the forward FFT thread. The 'master' pointer is a convenience copy of Frontend.in, which itself is a pointer to a dynamically allocated block in create_filter_input(). Apparently it thinks Frontend.in may be uninitialized before it is read? Frontend is a global structure, so Frontend.in is at least initialized to NULL at startup (as opposed to being completely uninitialized) though it would still be an error to dereference a null pointer.

create_filter_input() is called from setup_frontend() in main.c just before the estimate_n0 thread is started so Frontend.in is already initialized. Not only that, but estimate_n0 doesn't even get look at master->fdomain until it has been woken up by the master thread after it has processed a block of data. Perhaps this logic is too difficult to figure out automatically.

Still, I am willing to try to figure out how to suppress this warning if only to avoid annoying more people with a spurious warning. I get spurious warnings all the time when I compile other peoples' code so I'd like to avoid contributing to the problem.

I can explain the overall structure of the filters in my package if you like. Each "master" section performs the (big) forward FFT shared by 0 or more "slave" sections that do the (usually much smaller) inverse FFTs on their chosen pieces of the frequency domain signal. The slaves synchronize themselves to the master with FFT block counters. Whenever the master has a new block ready, it increments its block counter and signals all its slaves, which sleep until the master's counter gets ahead of their own counter. Since this is a real time 1-to-N system, there's no back flow control; slaves must keep up or lose data. The master does write into a ND-way output ring buffer in case a slave gets unusually delayed by CPU scheduling.

The estimate_n0 thread estimates the noise power spectral density by piggybacking on this mechanism, looking directly at the master's frequency domain data.

estimate_n0 is rather experimental, and I think I'm going to revamp it anyway as it assumes the noise is flat over the entire spectrum (minus the edges with anti-alias filtering). This is a bad assumption on HF where noise is very definitely non-Gaussian. It also needs to account for different overlap factors that would bias the estimate.

--Phil

On 4/17/22 11:42, Clayton Smith wrote:

Valgrind reports that |estimate_n0| reads uninitialized memory:

|==12481== Conditional jump or move depends on uninitialised value(s) ==12481== at 0x1134D5: estimate_n0 (radio.c:137) ==12481== by 0x5C906DA: start_thread (pthread_create.c:463) ==12481== by 0x636761E: clone (clone.S:95) |

It seems this is because |fdomain| is not initialized. I don't yet know why that occurs.

— Reply to this email directly, view it on GitHub https://github.com/ka9q/ka9q-radio/issues/3, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJDI4ZSKW73AHU76AGX4FLVFRLQ3ANCNFSM5TUHIQ4A. You are receiving this because you are subscribed to this thread.Message ID: @.***>

argilo commented 2 years ago

Apparently it thinks Frontend.in may be uninitialized before it is read? [...] Perhaps this logic is too difficult to figure out automatically.

Valgrind is not a static analysis tool. It executes the program and keeps track of memory allocations along the way. Spurious warnings do occasionally happen, but it's usually right.

I'll let you know if I track down what's happening here. Thanks for the detailed explanations!

ka9q commented 2 years ago

Interesting, the name 'valgrind' sounds familiar but I've never actually used it.

If you can tell me the exact error it throws I'll try to find and fix it. It might be a race condition that simply hasn't shown up for me yet.

One thing about signal processing is that an uninitialized variable might cause nothing worse than a momentary blip in output at startup that you'll never notice otherwise. But one could just as easily segfault, or in the case of floating point operations, create a NaN that permanently poisons some long-term state. You may notice that I check for this in things like exponential smoothing filters. This is one of those things I wish you could enable for testing and disable for production, as you already can with assert(). --Phil

On 4/18/22 06:24, Clayton Smith wrote:

Apparently it thinks Frontend.in may be uninitialized before it is
read? [...] Perhaps this logic is too difficult to figure out
automatically.

Valgrind is not a static analysis tool. It executes the program and keeps track of memory allocations along the way. Spurious warnings do occasionally happen, but it's usually right.

I'll let you know if I track down what's happening here. Thanks for the detailed explanations!

— Reply to this email directly, view it on GitHub https://github.com/ka9q/ka9q-radio/issues/3#issuecomment-1101406328, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJDI4ZTWUZECRQ3XEL3ZQLVFVO7NANCNFSM5TUHIQ4A. You are receiving this because you commented.Message ID: @.***>

argilo commented 2 years ago

The exact error messages are as follows:

==28310== Thread 4 estn0:
==28310== Conditional jump or move depends on uninitialised value(s)
==28310==    at 0x1134F1: estimate_n0 (radio.c:147)
==28310==    by 0x5C906DA: start_thread (pthread_create.c:463)
==28310==    by 0x636761E: clone (clone.S:95)
==28310== 
==28310== Conditional jump or move depends on uninitialised value(s)
==28310==    at 0x11341B: estimate_n0 (radio.c:137)
==28310==    by 0x5C906DA: start_thread (pthread_create.c:463)
==28310==    by 0x636761E: clone (clone.S:95)
==28310== 

Unfortunately, valgrind doesn't say which input to the conditional jump was uninitialized, but by adding printf statements for each of the variables it's possible to check which one is uninitialized. In that way, I can see that fdomain[132501] is uninitialized when it read at line 146 in the first loop iteration.

argilo commented 2 years ago

Here's how I did that:

diff --git a/radio.c b/radio.c
index 5221c0b..64598d8 100644
--- a/radio.c
+++ b/radio.c
@@ -133,6 +133,10 @@ void *estimate_n0(void *arg){
     if(init){
       int bin = first_bin;
       for(int i=0; i < bincnt; i++){
+        printf("1 i=%d\n", i);
+        printf("1 avg_pwrs[i]=%g\n", avg_pwrs[i]);
+        printf("1 bin=%d\n", bin);
+        printf("1 fdomain[bin]=%g %g\n", crealf(fdomain[bin]), cimagf(fdomain[bin]));
        avg_pwrs[i] += (cnrmf(fdomain[bin]) - avg_pwrs[i]) * .02; // tune or auto adjust this?
        min_bin_power = min(min_bin_power,avg_pwrs[i]);
        if(++bin == master->bins)
@@ -143,6 +147,10 @@ void *estimate_n0(void *arg){
       // and there can be a time skew between the frequency change and changing data
       int bin = first_bin;
       for(int i=0; i < bincnt; i++){
+        printf("2 i=%d\n", i);
+        printf("2 avg_pwrs[i]=%g\n", avg_pwrs[i]);
+        printf("2 bin=%d\n", bin);
+        printf("2 fdomain[bin]=%g %g\n", crealf(fdomain[bin]), cimagf(fdomain[bin]));
        avg_pwrs[i] = cnrmf(fdomain[bin]);
        min_bin_power = min(min_bin_power,avg_pwrs[i]);
        if(++bin == master->bins)

I built radio in debug mode, then ran valgrind ./radio /etc/radio/radio@2m.conf.

There are also some warnings about uninitialized bytes being passed to setsockopt but those can be ignored; valgrind apparently doesn't know about the padding in the struct.

argilo commented 2 years ago

It looks like the problem is that run_fft writes to f->fdomain[0] and wakes up estimate_n0, which reads from master->fdomain[1].

argilo commented 2 years ago

I see that run_fft increments block_num just before waking estimate_n0, so maybe there's an off-by-one here.

ka9q commented 2 years ago

Ah ha! I recently changed how that worked to get around a startup deadlock in wfm.c when a slave filter is read immediately after the the master is written. I thought I had fixed an off-by-one error but apparently I created another in the process. --Phil

On 4/18/22 12:27, Clayton Smith wrote:

It looks like the problem is that |run_fft| writes to |f->fdomain[0]| and wakes up |estimate_n0|, which reads from |master->fdomain[1]|.

— Reply to this email directly, view it on GitHub https://github.com/ka9q/ka9q-radio/issues/3#issuecomment-1101690695, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJDI456H3CLVRDY2XYPKE3VFWZTVANCNFSM5TUHIQ4A. You are receiving this because you commented.Message ID: @.***>

ka9q commented 2 years ago

Try the patch I just pushed over. When I fixed the off-by-one between the master and slave filter sections in filter.c, this introduced an off-by-one in the way estimate_n0's reads the frequency domain data directly from the master section.

The only harm would have been a startup glitch in the n0 estimate, plus updating the estimate several buffers back instead of on the current estimate, neither of which was a big deal. But I still like to get rid of warnings.

On 4/18/22 12:31, Clayton Smith wrote:

I see that |run_fft| increments |block_num| just before waking |estimate_n0|, so maybe there's an off-by-one here.

— Reply to this email directly, view it on GitHub https://github.com/ka9q/ka9q-radio/issues/3#issuecomment-1101693679, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJDI42OCXFEKJD5XIPZCCTVFW2CNANCNFSM5TUHIQ4A. You are receiving this because you commented.Message ID: @.***>

argilo commented 2 years ago

Looks good; radio now runs cleanly with no valgrind errors.