jcsilva / deep-clustering

71 stars 35 forks source link

Plans? #1

Open gauss256 opened 7 years ago

gauss256 commented 7 years ago

Hi Eduardo,

What are your plans for this code? It is described as unfinished and has not been updated lately.

I am interested in working on this, and even more interested in code for this subsequent paper that extends the technique to get much better results.

https://arxiv.org/abs/1607.02173

amiasato-zz commented 7 years ago

Hello,

I just merged branch 'intra_net' into master, since the former had most of the successful code. We had plans to implement the mentioned paper, which is a continuation of the original work from the guys at MERL, but unfortunately we had other priorities. The branch 'end2end' was supposed to initiate this work.

I really didn't re-test the code, and since it has been a while since we fiddled with this project, there may be something broken or out of place. Feel free to file an issue if this is the case.

IF, and that's a big if because I really don't encourage you to do so, but if you wish to delve into spaghetti code, check our speech enhancement branches, namely 'irm' for ideal ratio masks and 'softmask' for some really experimental spectral masks based on the deep clustering framework.

DoubleTao93318 commented 7 years ago

Hello Eduardo, May 24th, I heard about this method had been fully implemented, and the results were pretty good. Can you update it if it's convenient?

zhr1201 commented 7 years ago

Hi Eduardo, I am implementing DC in tensorflow and just can not get the nets to converge on a 30 h data set (converge well on small data sets). My implementation is kind of very similar of yours which means no weighting, - 40 dB threshold for the loss function calculation, 2 layer BLSTM + FC ... So I checked out your code on github but it is described as unfinished, I want to know what is unfinished so that I can learn from your code.

gauss256 commented 7 years ago

@zhr1201 I've forked the code and updated it to use Keras 2 and TensorFlow 1.1. I haven't cleaned it up and made it publicly available yet though. Let me know if that would be useful to you.

I too have had good results on small-ish data sets (~1 hour or so of training data).

Is your data set available for me to try?

zhr1201 commented 7 years ago

@gauss256 I just randomly mix up utterances from different speakers using TIMIT corpus. Seems that you've been focused on DC for some time and I'm curious about how is your result on a test set?

I guess my implementation is just like jcsilva's repository: same network structure, vanilla l2 loss after filtering the silent TF bins, and different optimizers with different hyper parameters have been tried out and it just won't converge. I kind of worry that if there is some kind of tricks used which the author forgot to mention in the paper.

I haven't run jcsilva's code yet since I haven't installed Keras on my labs' server. Seems that his code makes sense and I don't know what problem he encountered. If you get any ideas about why jcsilva labels the rep as unfinished, please let me know.

Thanks bro.

amiasato-zz commented 7 years ago

This work is labeled unfinished because we never got to prepare the code for the SDR results in the original paper, so we didn't really reproduce the paper's pipeline. However, there should be enough code to generate binary masks for speech separation, and training should converge without much hassle. As aforementioned, the softmasks from the end-to-end extended work were not implemented.

As of the resulting quality and practical applications, we never really went too far. We were investigating speech enhancement techniques at the time, and MERL's speech separation approach turned out to be an overkill (and didn't really produce great results for us). Seems like there is a limit to what can be achieved with single-channel sources.

We were applying all this work in Brazilian Portuguese speech processing. If you are curious about what we used, we started our work using the benchmark dataset provided here, along with the CHiME3 noise dataset for data augmentation. I don't know if the later is still available for public use, as there are more recent CHiME challenges.

Latest work from MERL focuses on music separation, awesome work btw: http://www.merl.com/publications/docs/TR2017-010.pdf

Looks way more promising in terms of practicality than single-channel speaker separation IMO. Also, it looks easier to implement.

zhr1201 commented 7 years ago

@akira-miasato It's been of great help, thx! And it's also obvious that dc is not currently suitable for practical use and more research can be done to make it a practically powerful tool.

isklyar commented 7 years ago

@gauss256 Hi, are you going to open source your Tensorflow extension of this repository? I am going to use DC algorithm in source separation experiments for my master thesis and your code would be of great help for me. Thanks in advance.

gauss256 commented 7 years ago

Yes, since there is some interest, I will do that. I'll see if I can get to it this weekend.

On Fri, Jul 28, 2017 at 6:58 AM, isklyar notifications@github.com wrote:

@gauss256 https://github.com/gauss256 Hi, are you going to open source your Tensorflow extension of this repository? I am going to use DC algorithm in source separation experiments for my master thesis and your code would be of great help for me. Thanks in advance.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jcsilva/deep-clustering/issues/1#issuecomment-318659056, or mute the thread https://github.com/notifications/unsubscribe-auth/AIvlmTmm1jQDBDPgTDY6YtO3ierwIMxnks5sSekAgaJpZM4L-3Co .

gauss256 commented 7 years ago

@isklyar I have updated my fork here: https://github.com/SingSoftNext/deep-clustering

It's not pretty, but it works. I'm going to turn my attention now to implementing the algorithm in this follow-on paper: https://arxiv.org/abs/1611.08930

It's probably not hard to adapt the DC code to DANet. If someone gets there before I do, please let us know!

isklyar commented 7 years ago

@gauss256 great, thank you!! Most probably I will also work on extending vanilla DC with some type of end-to-end training, either with DANet or enhancement network. I will let you know if I achieve smth in this direction.

zhr1201 commented 7 years ago

@isklyar @gauss256 My tensorflow implementation has also been updated just now, because I'm going to look for a job and I didn't update my github a lot before TUT. https://github.com/zhr1201/deep-clustering/tree/master

You are welcome to check that out. If you want to get fair performances in reverberant conditions, use a reverberant training set can get satisfactory performance.

By the way, I'm now also working on DANet, and have got some results on small datasets with 2-mixtures. For DC, the loss function is invariant for 2 and 3 mixtures. However, it's kind of hard to deal with different conditions with one mixture, two mixture, three mixture in DANet in tensorflow, especially flow control of every sample in a batch of data.

gauss256 commented 7 years ago

@zhr1201 I hope you will be able to post your DANet code. It would speed things up to have that as a starting point.

zhr1201 commented 7 years ago

@gauss256 We are not able to open source our implementation of DANet right away because we are now working together with a company. Sorry about that.

mpariente commented 6 years ago

@gauss256 Did you go forward with the implementation of the DANet? I'm getting interesting in it and it would be great to work on it together, some things are blocking me.

gauss256 commented 6 years ago

Yes, we have a rough implementation of DANet. Would be happy to collaborate. My email address is in my Github profile. Send me a message there.

LiJiongliang commented 6 years ago

@zhr1201 as you mentioned, dc is not currently suitable for practical use and moved to the DANet. how about DANet real time performance based some embedded processor?

zhr1201 commented 6 years ago

@LiJiongliang It is not a real time model because you need to feed in trunks of frames once (approximately 0.8 s data).

mpariente commented 6 years ago

@zhr1201 IN DANet, if you use the anchored attractor points, you can implement it in "real-time", the main delay will be the one of the Fourier transform, so here 32 ms, way under the 0.8s you suggest.

Just to explain where the 0.8s comes from and why we can do better : the first training phase uses 100 frames of STFT as the input, each frame is 32 ms long but they are computed every 8ms, so 100*8ms = 0.8s. But the main point of the LSTM is that you can keep the cell state, so you can feed the 100 frames one by one and obtain exactly the same result. So the minimum delay would be "just" 32ms