lisad / phaser

The missing layer for complex data batch integration pipelines
MIT License
9 stars 1 forks source link

DataFramePhase must be subclassed ... seems annoying #80

Closed lisad closed 7 months ago

lisad commented 8 months ago

I went back to create a DataFramePhase and got surprised by stuff we'd done just a few weeks ago... that we have to subclass DataFramePhase to override 'df_transform'

We should have a way to init a DataFramePhase with a method name passed to the instantiation that gets run in df_transform

def explode_language_list(df, context=None):
   df['languages'] = df['languages'].str.split(',')
   df = df.explode('languages')
   return df.rename(columns={'languages': 'language'})

my_phase = DataFramePhase(step=explode_language_list)

... I'm also bending somewhat on the idea that only one step is allowed, I'm pretty sure people are going to put multiple logical steps in one if we only allow one step. Given that somebody will have a list of steps that they want to do on DataFrames (especially if they are migrating from a pandas oriented pipeline, or trying to automate a jupyter notebook worth of work) -- allowing only one step will encourage them to list all the existing in that one step.

On discussion we agree we should go back in this direction. Not only because dataframe work might have multiple steps and be passed into the constructor, but also just to allow more declarative coding rather than subclassing...