jzi040941 / PercepNet

Unofficial implementation of PercepNet: A Perceptually-Motivated Approach for Low-Complexity, Real-Time Enhancement of Fullband Speech
BSD 3-Clause "New" or "Revised" License
325 stars 91 forks source link

Discussion about PercepNet #7

Closed jzi040941 closed 3 years ago

jzi040941 commented 3 years ago

I got a email from Yuyung Liau who want me to make discussion on Github issue since I got few email about implementing PercepNet, I think it's better to share with more people not one by one So I opened this Issue Any kinds of Discussion about PercepNet are welcomed!

OscarLiau commented 3 years ago

Hi Noah,

Thanks for open this page for discussion! I've been working on similar project (PercepNet on AEC) recently, and I met some troubles regarding pitch coherence calculation, so I was wondering if I did anything wrong in this part.

Firstly, I used the comb filter on clean speech x[n] and I got p-hat[n], then I calculated pitch coherence: pcoh

Secondly, since this paper says p is not available, I used p-hat to calculate pitch coherence of x, y and p-hat. And there comes a problem, the calculated comb-filter ratio, r, does not range from 0~1 based on the calculated coherence, and some times even alpha is Nan due to negative value inside square root: alpha

So I have several questions regarding to this part and need some idea:

  1. Which signal to apply comb filter and yield p-hat[n]?
  2. What's the reasonable range of value for pitch coherence? The range I got is from -1~1, and that yields abnormal comb-filter ratio r.
  3. How to calculate pitch coherence of x, y and p-hat? I presume p in pitch coherence equation is p-hat, don't know if that's right?

Thank you!

jzi040941 commented 3 years ago

Hi, Oscar

This is my answer on my opinion

  1. I apply comb filter to X and yield p-hat and also apply comb filter to Y and yield another p-hat (not sure this this is correct)
  2. reasonable range of pitch coherence is 0~1 I think.
  3. only pitch coherence of p-hat I use attenuation term, coherence of x,y is calculated by equation(5), I also assume p is p-hat when calculating pitch coherence

If you have nan problem one my opinion is clipping your data. sometimes pitch coherence value goes minus my solution was making it 0 when It is under 0

What I write is based on my understanding. It could be wrong so if anyone have intuition about this feel free to leave comment Thanks!

OscarLiau commented 3 years ago

Thanks Noah! But I still got an Nan problem on alpha based on your suggestion, so I took absolute value on pitch coherence to solve this problem.

However, the abnormal comb-filter strength remained, some values I got didn't range from 0~1. Did you have this problem when you clipped the pitch coherence? I don't know if I should clip comb-filter strength as well.

sTarAnna commented 3 years ago

Hi Noah,

Thanks for sharing your code!

I'm now running your latest data generation code. Observing the result by using two sample data, I have a question about the data generation step in your code. The calculated gains of the first five frames are always 0, I think it is made by the 'look forward' in the origin paper. I think the first calculated gains and filter strength is the -5th frames label when you input the 0th frame. I think training the model may need to shift the output of gain and filter strength 5frame ahead. Could you please tell me if I had make something wrong, thanks!

jzi040941 commented 3 years ago

Thanks Noah! But I still got an Nan problem on alpha based on your suggestion, so I took absolute value on pitch coherence to solve this problem.

However, the abnormal comb-filter strength remained, some values I got didn't range from 0~1. Did you have this problem when you clipped the pitch coherence? I don't know if I should clip comb-filter strength as well.

Hi, Oscar! I just show you and explain about my function of filter strength calculation.

void filter_strength_calc(float *Exp, float *Eyp, float *Ephatp, float* r){
  //define variable
  for(int i=0; i<NB_BANDS; ++i){
    a = Ephatp[i]*Ephatp[i] - Exp[i]*Exp[i];
    if (a<0) a=0;
    b = Ephatp[i]*Eyp[i]*(1-Exp[i]*Exp[i]);
    c = Exp[i]*Exp[i]-Eyp[i]*Eyp[i];
    if (c<0) c=0;
    alpha = (sqrt(b*b + a *(c))-b)/(a+1e-8);
    r[i] = alpha/(1+alpha);
  }
}

I also applied clipping for each term not only pitch coherence , for me sometimes a and c terms are lower then 0 so set it 0 forcefully. I thought Exp is larger then Eyp (Exp>Eyp) in ideal. cause y is x+noise so Y correlation between P (Eyp) must be lower then Exp. but in real it sometime or someband Exp is smaller then Eyp (Exp<Eyp) I assume this data is outlier so I set it 0. at the end alpha term I add small epsilon to prevent nan error

result of this function always range in 0~1. (But still wondering it's correct or not)

jzi040941 commented 3 years ago

Hi Noah,

Thanks for sharing your code!

I'm now running your latest data generation code. Observing the result by using two sample data, I have a question about the data generation step in your code. The calculated gains of the first five frames are always 0, I think it is made by the 'look forward' in the origin paper. I think the first calculated gains and filter strength is the -5th frames label when you input the 0th frame. I think training the model may need to shift the output of gain and filter strength 5frame ahead. Could you please tell me if I had make something wrong, thanks!

Hi, @sTarAnna!

Thanks for checking my code reason that first five output are zero is because of comb_buf. comb_buf is needed for comb filtering implementation I use 5frame size buffering in comb_buf. so It takes 5times to get a first output. also, I buffered also Y not only output(r,g), you can check it first five of Ey,EphatY are also zero on result

sTarAnna commented 3 years ago

Thanks for your replying! I have found my mistake.

OscarLiau commented 3 years ago

Hi, Oscar! I just show you and explain about my function of filter strength calculation.

void filter_strength_calc(float *Exp, float *Eyp, float *Ephatp, float* r){
  //define variable
  for(int i=0; i<NB_BANDS; ++i){
    a = Ephatp[i]*Ephatp[i] - Exp[i]*Exp[i];
    if (a<0) a=0;
    b = Ephatp[i]*Eyp[i]*(1-Exp[i]*Exp[i]);
    c = Exp[i]*Exp[i]-Eyp[i]*Eyp[i];
    if (c<0) c=0;
    alpha = (sqrt(b*b + a *(c))-b)/(a+1e-8);
    r[i] = alpha/(1+alpha);
  }
}

I also applied clipping for each term not only pitch coherence , for me sometimes a and c terms are lower then 0 so set it 0 forcefully. I thought Exp is larger then Eyp (Exp>Eyp) in ideal. cause y is x+noise so Y correlation between P (Eyp) must be lower then Exp. but in real it sometime or someband Exp is smaller then Eyp (Exp<Eyp) I assume this data is outlier so I set it 0. at the end alpha term I add small epsilon to prevent nan error

result of this function always range in 0~1. (But still wondering it's correct or not)

Hi Noah, thanks for your sharing!

In recent days I've been working on post-filtering, trying to integrate whole PercepNet before training NN. I have temporarily finished the post-filtering part and obtain the final result x_hat. However, the result shows no sign of good performance with either my adaptation (taking abs to all coherence and alpha) or your suggestion (data clipping). I was wondering if anything else goes wrong and maybe it's on post-filtering part. Not sure if you have already tried post-filtering, I listed some problems I have:

  1. I used the strength calculated by coherence and gain defined in paper to run PercepNet: gain strength

Taking one set of training data, I think if using these two parameters to process noisy signal y with PercepNet, the output x_hat should be close to clean x, right?

  1. The warped gain in envelope post-filtering section: g_warped

Since this calculation includes sine function, sometimes this warped gain could yield negative value, don't know if this is reasonable? As usual I took abs value to this warped gain to avoid Nan problem when calculating global gain G: G

in which E0 and E1 I simply took eq.(2) and eq.(13) respectively.

  1. The definition of energy is obscure in this paper. In post-filtering part, it says the energy is: band energy

In signal processing, we often calculate the sum of squared signal value as energy. However, in previous section of this paper, it defines the same symbol as 2-norm of signal:

energy def

While I was calculating envelope post-filtering section, I was not sure if I should take 2-norm value or squared value as energy. I chose the former to calculate across all PercepNet.

If you have any suggestion or comment, please share with me, thank you!

jzi040941 commented 3 years ago

Hi, Oscar I haven't tried post filtering yet. But I get the good performance yesterday applying only gain and filter strength with my implementation (data clipping). I recommend you to check band gain multiplication for stft domain which was problem for me how about testing your x_hat without postfiltering?

  1. Yes, you right. I used strength and gain like you mentioned, I get X_hat using this two parameters to Y. and my result of X_hat is almost close to x which means it removed noise

  2. since I haven't tired post filtering yet. I have not much intuition about it. I want to ask something about question rather than answer. (I hope any other people answer to you) I want to ask why the warped gain could yield negative value. range of gb is 0 ~ 1 right? then, range for argument in sine function must be 0 ~ pi/2 it's obvious that sin(0)=0 sin(pi/2)=1. I think it cannot become negative and one more thing could you explain what is E0 and E1?

  3. I think you should use squared value. based on RNNoise github which is Previous version of Author of Percepnet. He calculated band energy with squared value. also I appied same as RNNoise like below

    void compute_band_energy(float *bandE, const kiss_fft_cpx *X) {
    // ...
    for (j=0;j<band_size;j++) {
      float tmp;
      float frac = (float)j/band_size;
      tmp = SQUARE(X[(erb_band->nfftborder[i]) + j].r);
      tmp += SQUARE(X[(erb_band->nfftborder[i]) + j].i);
      sum[i] += (1-frac)*tmp;
      sum[i+1] += frac*tmp;
    }

Thanks!

OscarLiau commented 3 years ago

Hi Noah,

Thanks for your test on PercepNet performance! Answer to 1 and 2, I think I understand why my gb sometimes larger than 1, since I use PercepNet on different application.

In my AEC application, the noisy y is AEC output, the echo-cancelled signal. As a result, it's reasonable that energy of y sometimes is lower than near-end clean speech x.

The E0 and E1 is the energy of enhanced signal using gb and g_warped respectively. If the signal energy is sum of squared, then (E0/E1) should be (gb/g_warped)^2.

Therefore, for PercepNet on AEC application, I will continue on some modification in DNS-Challenge PercepNet to suit for AEC application, especially when it comes to energy-related calculation.

Thanks!

yin-zhang commented 3 years ago

is there a pre-trained model for PercepNet?

zhangyutf commented 3 years ago

Hi Noah, Thank you for your sharing! I'm stuck with the gain attenuation term computing formula. image In your code, that is: image Exp square and Ephatp square can be any non-negative number, and then the value under the square root may be negative. How to solve this problem?

jzi040941 commented 3 years ago

is there a pre-trained model for PercepNet?

there's no pre-trained model yet. feel free to contribute if you already have it!

jzi040941 commented 3 years ago

Hi Noah, Thank you for your sharing! I'm stuck with the gain attenuation term computing formula. image In your code, that is: image Exp square and Ephatp square can be any non-negative number, and then the value under the square root may be negative. How to solve this problem?

I recommend you to make Exp and Ephatp 0 if it's lower than 0 before adjust gain strength like below

for (int i=0; i<NB_BANDS; ++i){
   if(EPhatp[i]<0) EPhatp[i] = 0;
   if(Exp[i]<0) Exp[i] =0;
}
jzi040941 commented 3 years ago

move to https://github.com/jzi040941/PercepNet/discussions

YangangCao commented 3 years ago

Hi, Oscar

This is my answer on my opinion

  1. I apply comb filter to X and yield p-hat and also apply comb filter to Y and yield another p-hat (not sure this this is correct)
  2. reasonable range of pitch coherence is 0~1 I think.
  3. only pitch coherence of p-hat I use attenuation term, coherence of x,y is calculated by equation(5), I also assume p is p-hat when calculating pitch coherence

If you have nan problem one my opinion is clipping your data. sometimes pitch coherence value goes minus my solution was making it 0 when It is under 0

What I write is based on my understanding. It could be wrong so if anyone have intuition about this feel free to leave comment Thanks!

Hello, I am doing PercepNet AEC now again! Can we have a talk? maybe help each other

jzi040941 commented 3 years ago

Hi, Oscar This is my answer on my opinion

  1. I apply comb filter to X and yield p-hat and also apply comb filter to Y and yield another p-hat (not sure this this is correct)
  2. reasonable range of pitch coherence is 0~1 I think.
  3. only pitch coherence of p-hat I use attenuation term, coherence of x,y is calculated by equation(5), I also assume p is p-hat when calculating pitch coherence

If you have nan problem one my opinion is clipping your data. sometimes pitch coherence value goes minus my solution was making it 0 when It is under 0 What I write is based on my understanding. It could be wrong so if anyone have intuition about this feel free to leave comment Thanks!

Hello, I am doing PercepNet AEC now again! Can we have a talk? maybe help each other

Sorry I was late Good to see you again and of course we can help each other! Feel free to use this page https://github.com/jzi040941/PercepNet/discussions

cloudvc commented 2 years ago

@jzi040941 Thank for your great work. Do you have any plan about RES+NS based on PercepNet?