Questions about training my model

I'm trying to switch from the CoLab version on Google to here, so I can get an output faster but I'm not sure where to begin on training my model, which is currently the base model that ships with every release.

So, what audio is required? Does it have to be from https://sigsep.github.io/datasets/dsd100.html? If not, I currently have 2 files, a vocal version of a song and an off vocal version of the same song for training material. To see if it will extract the vocals correctly, I put it in Audacity (process is essentially https://www.youtube.com/watch?v=_p5QC-4jGWA); TL;DR it runs an invert on one of the tracks while aligning the tracks so the music cancels out and theoretically I'm left with vocals only. The issue is that there's some parts that are also audible, e.g. synth. If the tracks aren't properly timed or if that happens, will my output after training my model act up by removing things rather than vocals?
edit: I've tried looking at the databases online, and they have vocals, mixtures, but no instrumental files, how do I generate those? ffmpeg does weird things such as play with the dB gain when I use amix.
After running python3 train.py ..., I get # epoch 0 and the process doesn't exit, does this mean I should wait? How much epochs are there until it ends, and how can I make sure it gets piped to a pth file? (How do I exit training properly)

Thank you for taking the time to read this...

tsurumeso / vocal-remover

Questions about training my model #112