Open mxkrn opened 11 months ago
This isn't going to fit into the initial release, but let's look at it again in #41. Note that jobs pin a klay_beam
version, we can more safely make breaking changes as long as klay_beam respects semver
. Below is the original description from #33
By convention, klay beam passes
(filename, a, sr)
tuples between mostDoFn
s where:
filename
is a stringa
is a pytorch Tensor (althoughLoadWithPytorch
does have the optionalnumpy=True
argument)sr
is an intThe exception is when passing an in-memory audio file,
(filename, io.BytesIO)
, which does not include a sample rate.Because this refactor involves changing the message passing convention mentioned above anyway (#31), we should consider #6 again. There some situations where it would be nice to include additional optional information in the objects passed between
DoFn
s. Right nowSkipCompleted
needs to "know" information about the target file extension. If a downstream transform (such asExtractChromaFeatures
) wants to set the file extension, it will make it very easy for the user to forget to update the SkipCompletednew_suffix
argument.This is just one example of when it would be nice to forward additional information beyond just
(fn, a, sr)
in messages.Perhaps a serializable
KlayBeamAudioMessage
class with named properties instead of just a tuple (or named tuple) would work well.
Adding to this, in each job's transforms.process
, we are hardcoding the audio_suffix
to remove from the filename.
Currently we're passing data around between pipeline steps using anonymous tuples. This is somewhat brittle and could be improved using at least named tuples, potentially even dataclasses if more involved data types and potential validation are required. These will also act as implicit documentation.
To avoid circular imports these named tuples should probably all live in one place i.e.
klay_beam/src/klay_beam/types.py