the model implementation comprehension

Hello, I'm a master student in ITMO university in Saint-Petersburg, Russia.

Could you explain me please, what exactly this model implemenation do? As for me (variant 1) it takes as input mixed sound of voice of a person A and voice of a person B and clear voice A, the same as in mixed one and trying to extract it from the mixed one. (that is really strange because it is useless) And in the paper (variant 2) it is said that it should take the mixed one and clear voice of the target person but NOT the same sound as in mixed one! And this is the point.

When I tried to look at train test, made by generator, I found out that in every example of **-mixed.wav there is **-target.wav with another voice! (but not another phrase of target person as I thought it should be)

Am I right? Or what's going on here?

Waiting for your answer, thank you!

maum-ai / voicefilter

the model implementation comprehension #22