Open francqz31 opened 7 months ago
Thanks for your interest. I'm just having one last clean of the data and rejigging the synthetic generation for a last run to see if I can improve the model, my notes are all in the readme. My biggest issue is just how slow the data processing is taking at the moment, I'm getting slightly distracted by solving that problem :)
My new mega moonshot is to run all the audiothrough a denoiser first before training. This can be kind of seen as a normlisation step and will hopefully mean that new data won't be so "out of domain".
I'll hopefully have some results from this at the end of the week.
Compute wise if you have an a100 I can ssh into that would definitely speed things up :)
feel free to DM me on Signal
1-amazing , well I'm short of A100s now i used to have 9 , I have Rtx 4090 and Rtx 3090, one is getting trained on and one's not , so i don't know if that would help ?? 2- I can recommend some of the best or SOTA denoiser/speech enhancement algos if you want
Ok wonderful , also once i have my 9A100s back i will still offer them if you need them in any interesting project. :) for denoising and enhancement the best thing till now is that https://github.com/yxlu-0102/MP-SENet :) . try it if you want and see if it is suitable for your usage if not i will recommend something else . but in my use case this works the best
there is also hifigan v2 https://daps.cs.princeton.edu/projects/Su2021HiFi2/ ? but no code available for it , but later i might try implementing it from https://github.com/rishikksh20/hifigan-denoiser (which is v1 unofficial implementation) and add something more to it.
Hey Author , I really like the architecture used and the technique. I was looking for something similar to this to diarize 1k+ hours of different speakrs for tts as accurate as it can get. I wanna see any result of nanodrz in real use for example this video https://streamable.com/m5xvgf
I would like to contribute by compute or knowledge to scale this up and for it to become the new Sota, or be 99-100% accurate to unknown number of speakers.
Thanks in advance