klay-music / klay-beam

Our Apache Beam Transforms and Pipelines
0 stars 0 forks source link

Pass data between pipeline steps using named tuples or dataclasses #41

Open mxkrn opened 11 months ago

mxkrn commented 11 months ago

Currently we're passing data around between pipeline steps using anonymous tuples. This is somewhat brittle and could be improved using at least named tuples, potentially even dataclasses if more involved data types and potential validation are required. These will also act as implicit documentation.

To avoid circular imports these named tuples should probably all live in one place i.e. klay_beam/src/klay_beam/types.py

CharlesHolbrow commented 9 months ago

This isn't going to fit into the initial release, but let's look at it again in #41. Note that jobs pin a klay_beam version, we can more safely make breaking changes as long as klay_beam respects semver. Below is the original description from #33

By convention, klay beam passes (filename, a, sr) tuples between most DoFns where:

  • filename is a string
  • a is a pytorch Tensor (although LoadWithPytorch does have the optional numpy=True argument)
  • sr is an int

The exception is when passing an in-memory audio file, (filename, io.BytesIO), which does not include a sample rate.

Because this refactor involves changing the message passing convention mentioned above anyway (#31), we should consider #6 again. There some situations where it would be nice to include additional optional information in the objects passed between DoFns. Right now SkipCompleted needs to "know" information about the target file extension. If a downstream transform (such as ExtractChromaFeatures) wants to set the file extension, it will make it very easy for the user to forget to update the SkipCompleted new_suffix argument.

This is just one example of when it would be nice to forward additional information beyond just (fn, a, sr) in messages.

Perhaps a serializable KlayBeamAudioMessage class with named properties instead of just a tuple (or named tuple) would work well.

cyrusvahidi commented 9 months ago

Adding to this, in each job's transforms.process, we are hardcoding the audio_suffix to remove from the filename.