Open gauss256 opened 7 years ago
Hello,
I just merged branch 'intra_net' into master, since the former had most of the successful code. We had plans to implement the mentioned paper, which is a continuation of the original work from the guys at MERL, but unfortunately we had other priorities. The branch 'end2end' was supposed to initiate this work.
I really didn't re-test the code, and since it has been a while since we fiddled with this project, there may be something broken or out of place. Feel free to file an issue if this is the case.
IF, and that's a big if because I really don't encourage you to do so, but if you wish to delve into spaghetti code, check our speech enhancement branches, namely 'irm' for ideal ratio masks and 'softmask' for some really experimental spectral masks based on the deep clustering framework.
Hello Eduardo, May 24th, I heard about this method had been fully implemented, and the results were pretty good. Can you update it if it's convenient?
Hi Eduardo, I am implementing DC in tensorflow and just can not get the nets to converge on a 30 h data set (converge well on small data sets). My implementation is kind of very similar of yours which means no weighting, - 40 dB threshold for the loss function calculation, 2 layer BLSTM + FC ... So I checked out your code on github but it is described as unfinished, I want to know what is unfinished so that I can learn from your code.
@zhr1201 I've forked the code and updated it to use Keras 2 and TensorFlow 1.1. I haven't cleaned it up and made it publicly available yet though. Let me know if that would be useful to you.
I too have had good results on small-ish data sets (~1 hour or so of training data).
Is your data set available for me to try?
@gauss256 I just randomly mix up utterances from different speakers using TIMIT corpus. Seems that you've been focused on DC for some time and I'm curious about how is your result on a test set?
I guess my implementation is just like jcsilva's repository: same network structure, vanilla l2 loss after filtering the silent TF bins, and different optimizers with different hyper parameters have been tried out and it just won't converge. I kind of worry that if there is some kind of tricks used which the author forgot to mention in the paper.
I haven't run jcsilva's code yet since I haven't installed Keras on my labs' server. Seems that his code makes sense and I don't know what problem he encountered. If you get any ideas about why jcsilva labels the rep as unfinished, please let me know.
Thanks bro.
This work is labeled unfinished because we never got to prepare the code for the SDR results in the original paper, so we didn't really reproduce the paper's pipeline. However, there should be enough code to generate binary masks for speech separation, and training should converge without much hassle. As aforementioned, the softmasks from the end-to-end extended work were not implemented.
As of the resulting quality and practical applications, we never really went too far. We were investigating speech enhancement techniques at the time, and MERL's speech separation approach turned out to be an overkill (and didn't really produce great results for us). Seems like there is a limit to what can be achieved with single-channel sources.
We were applying all this work in Brazilian Portuguese speech processing. If you are curious about what we used, we started our work using the benchmark dataset provided here, along with the CHiME3 noise dataset for data augmentation. I don't know if the later is still available for public use, as there are more recent CHiME challenges.
Latest work from MERL focuses on music separation, awesome work btw: http://www.merl.com/publications/docs/TR2017-010.pdf
Looks way more promising in terms of practicality than single-channel speaker separation IMO. Also, it looks easier to implement.
@akira-miasato It's been of great help, thx! And it's also obvious that dc is not currently suitable for practical use and more research can be done to make it a practically powerful tool.
@gauss256 Hi, are you going to open source your Tensorflow extension of this repository? I am going to use DC algorithm in source separation experiments for my master thesis and your code would be of great help for me. Thanks in advance.
Yes, since there is some interest, I will do that. I'll see if I can get to it this weekend.
On Fri, Jul 28, 2017 at 6:58 AM, isklyar notifications@github.com wrote:
@gauss256 https://github.com/gauss256 Hi, are you going to open source your Tensorflow extension of this repository? I am going to use DC algorithm in source separation experiments for my master thesis and your code would be of great help for me. Thanks in advance.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jcsilva/deep-clustering/issues/1#issuecomment-318659056, or mute the thread https://github.com/notifications/unsubscribe-auth/AIvlmTmm1jQDBDPgTDY6YtO3ierwIMxnks5sSekAgaJpZM4L-3Co .
@isklyar I have updated my fork here: https://github.com/SingSoftNext/deep-clustering
It's not pretty, but it works. I'm going to turn my attention now to implementing the algorithm in this follow-on paper: https://arxiv.org/abs/1611.08930
It's probably not hard to adapt the DC code to DANet. If someone gets there before I do, please let us know!
@gauss256 great, thank you!! Most probably I will also work on extending vanilla DC with some type of end-to-end training, either with DANet or enhancement network. I will let you know if I achieve smth in this direction.
@isklyar @gauss256 My tensorflow implementation has also been updated just now, because I'm going to look for a job and I didn't update my github a lot before TUT. https://github.com/zhr1201/deep-clustering/tree/master
You are welcome to check that out. If you want to get fair performances in reverberant conditions, use a reverberant training set can get satisfactory performance.
By the way, I'm now also working on DANet, and have got some results on small datasets with 2-mixtures. For DC, the loss function is invariant for 2 and 3 mixtures. However, it's kind of hard to deal with different conditions with one mixture, two mixture, three mixture in DANet in tensorflow, especially flow control of every sample in a batch of data.
@zhr1201 I hope you will be able to post your DANet code. It would speed things up to have that as a starting point.
@gauss256 We are not able to open source our implementation of DANet right away because we are now working together with a company. Sorry about that.
@gauss256 Did you go forward with the implementation of the DANet? I'm getting interesting in it and it would be great to work on it together, some things are blocking me.
Yes, we have a rough implementation of DANet. Would be happy to collaborate. My email address is in my Github profile. Send me a message there.
@zhr1201 as you mentioned, dc is not currently suitable for practical use and moved to the DANet. how about DANet real time performance based some embedded processor?
@LiJiongliang It is not a real time model because you need to feed in trunks of frames once (approximately 0.8 s data).
@zhr1201 IN DANet, if you use the anchored attractor points, you can implement it in "real-time", the main delay will be the one of the Fourier transform, so here 32 ms, way under the 0.8s you suggest.
Just to explain where the 0.8s comes from and why we can do better : the first training phase uses 100 frames of STFT as the input, each frame is 32 ms long but they are computed every 8ms, so 100*8ms = 0.8s. But the main point of the LSTM is that you can keep the cell state, so you can feed the 100 frames one by one and obtain exactly the same result. So the minimum delay would be "just" 32ms
Hi Eduardo,
What are your plans for this code? It is described as unfinished and has not been updated lately.
I am interested in working on this, and even more interested in code for this subsequent paper that extends the technique to get much better results.
https://arxiv.org/abs/1607.02173