Hello, I'm a master student in ITMO university in Saint-Petersburg, Russia.
Could you explain me please, what exactly this model implemenation do?
As for me (variant 1) it takes as input mixed sound of voice of a
person A and voice of a person B and clear voice A, the same as in
mixed one and trying to extract it from the mixed one.
(that is really strange because it is useless)
And in the paper (variant 2) it is said that it should take the mixed
one and clear voice of the target person but NOT the same sound as in
mixed one! And this is the point.
When I tried to look at train test, made by generator, I found out that in every example of **-mixed.wav there is **-target.wav with another voice! (but not another phrase of target person as I thought it should be)
I got it, I'm just blind and haven't found that audio which should be used for making embedding is located in file **-dvec.txt and everything is alright,but never-the-less I have problems with the inference...
Hello, I'm a master student in ITMO university in Saint-Petersburg, Russia.
Could you explain me please, what exactly this model implemenation do? As for me (variant 1) it takes as input mixed sound of voice of a person A and voice of a person B and clear voice A, the same as in mixed one and trying to extract it from the mixed one. (that is really strange because it is useless) And in the paper (variant 2) it is said that it should take the mixed one and clear voice of the target person but NOT the same sound as in mixed one! And this is the point.
When I tried to look at train test, made by generator, I found out that in every example of **-mixed.wav there is **-target.wav with another voice! (but not another phrase of target person as I thought it should be)
Am I right? Or what's going on here?
Waiting for your answer, thank you!