This is a common need - the builtin would look something a little like this in builtin_steps
def delete_duplicate_rows(columns=None):
"""
This step factory will build a step to delete rows that are duplicates of each other, based on every value
or based on a list of columns or column names.
:param func:
:return:
"""
@batch_step(check_size=False)
def delete_duplicate_rows_step(batch, context, **kwargs):
# Need an algorithm that is reasonable for now, and doesn't depend on pandas
return batch
This is a common need - the builtin would look something a little like this in builtin_steps