Job stem classifier - Githubissues

mxkrn commented 10 months ago

This PR contains the new stem classifier job which is used to classify glucose-karaoke files into the stem groups defined by source separation.

[x] ClassifyAudioStem is a DoFn that accepts a ReadableFile and identifies which StemGroup it belongs to
[x] A number of utility methods which are unit-tested in tests/test_transforms.py
[x] copy_file first checks if a file with that suffix exists, if it does it takes the latest stem enumeration, increments it, and writes a file with that new incremented suffix i.e. other -> other-1 -> other-2.

TODO

[x] Custom SkipCompleted which checks if a track directory exists. The original SkipCompleted doesn't work because we're dynamically updating the suffix based on the classified stem group. Since we're also incrementing based on a file existing, one option we have left is to check the existence of the track directories. The files for the tracks that do not yet exist are put up for classification. The only assumption we're making here is that when a track directory exists, it's complete.
[x] Wildcard characters i.e. ? and * are problematic for parsing the filenames. We need to strip these out of the track name before writing.
[ ] Dockerfile

mxkrn commented 10 months ago

@CharlesHolbrow As mentioned in a comment response, I would agree that this isn't parallelizable unless the additional file matching is built. I don't think that's worth our time right now.

This job is supposed to be reusable but, similar to source separation, I doubt we'll be re-using it very often. I imagine whenver we want to ingest new glucose-karaoke splits we'll want to re-use this job. I guess the main thing that's missing for it to be fully automatic is that the process for generating the stems_dict.json is currently done offline. This can be done in an online manner, it would just require a bit more engineering.

CharlesHolbrow commented 10 months ago

Yea, I think it's find to compute the stems dictionary offline for now if it helps things go quicker.

CharlesHolbrow commented 10 months ago

No Dockerfile is needed, because we're just running this locally.

I'm moving job-specific .gitignore lines into job directories. This means that the job package directories are portable–that is, we can copy them into a different repository or sub-dir in the future, and the .gitignored files will still be ignored.

klay-music / klay-beam

Job stem classifier #48

TODO