lisad / phaser

library for batch-oriented complex data integration pipelines
MIT License
3 stars 1 forks source link

Add a builtin step to delete duplicate rows #146

Open lisad opened 2 weeks ago

lisad commented 2 weeks ago

This is a common need - the builtin would look something a little like this in builtin_steps

def delete_duplicate_rows(columns=None):
    """
    This step factory will build a step to delete rows that are duplicates of each other, based on every value
    or based on a list of columns or column names.
    :param func:
    :return:
    """
    @batch_step(check_size=False)
    def delete_duplicate_rows_step(batch, context, **kwargs):
        # Need an algorithm that is reasonable for now, and doesn't depend on pandas
        return batch