starrytong / SCNet

MIT License
54 stars 3 forks source link

Training other targets than default BVOD? #13

Closed ari-ruokamo closed 2 months ago

ari-ruokamo commented 2 months ago

I see there's a closed ticket about training other targets than the default 4 stems - a follow up question:

I scripted the dataset for OV (other, vocals, mixture). I changed the config.yaml both,

data.sources: [other, vocals]
model.sources: [other, vocals]

Is the solver index manipulation still required like here: https://github.com/starrytong/SCNet/issues/4

Because the training and evaluation results seem not to be right e.g. total- and instrument nsdr values are negative or very low?

starrytong commented 2 months ago

No additional changes are needed. Is your 'other' section obtained by overlaying 'drums,' 'bass,' and 'others'? It's best not to do this, as it could weaken the effect of data augmentation.

ari-ruokamo commented 2 months ago

No additional changes are needed. Is your 'other' section obtained by overlaying 'drums,' 'bass,' and 'others'? It's best not to do this, as it could weaken the effect of data augmentation.

Yes it is - that came logically to my mind - as there are two targets, the vocals and mixture-vocals. Why it's not beneficial to do so? Isn't the augmentation done on per instrument basis? Should I opt only for vocals then if I desire to train a model that extracts and outputs the vocals and the "rest"?

Thanks again!

starrytong commented 2 months ago

It’s better to achieve the separation of vocals and other through modifications in the solver rather than by overlaying the data, unless you only have audio for "vocals" and "other".

            sources = sources.to(self.device)
            if train:
                sources = self.augment(sources)
                mix = sources.sum(dim=1)

                # data.sources: ['drums', 'bass', 'other', 'vocals']    ---vocals_idx=4, other_idx=3
                other = sources[:, 0:3].sum(dim=1)
                vocals = sources[:, 3].unsqueeze(dim=1)
                sources = torch.cat([other.unsqueeze(dim=1), vocals], dim=1)
            else:
                mix = sources[:, 0]
                other = sources[:, 1:4].sum(dim=1)
                vocals = sources[:, 4].unsqueeze(dim=1)
                sources = torch.cat([other.unsqueeze(dim=1), vocals], dim=1)
ari-ruokamo commented 2 months ago

It’s better to achieve the separation of vocals and other through modifications in the solver rather than by overlaying the data, unless you only have audio for "vocals" and "other".

-- clip --

Hmmm, the assertion fires in the upcoming row as the sources vs. estimates sizes differ, and the computation wouldn't pass the following spec_rmse_loss(...) function.

ari-ruokamo commented 2 months ago

Ok, I don't know what happened with the initial OV-configuration; I checked and reset everything and restarted the training run, and it seems the training took off differently and the model convergence looks promising.

Thank you for all your help @starrytong.

ari-ruokamo commented 2 months ago

I may close this ticket now.