move-coop / parsons

A python library of connectors for the progressive community.
https://www.parsonsproject.org/
Other
259 stars 131 forks source link

Add "deduplicate" helper method #820

Closed shaunagm closed 1 year ago

shaunagm commented 1 year ago

It would be good to expose basic functionality on the Parsons table to allow people to deduplicate their data. Currently a lot of folks convert the Parsons table to a Pandas dataframe to accomplish this which seems wasteful both in terms of developer times and also could potentially impact performance:

I'm imagining something that looks like this:

parsons_table.deduplicate(keys, sort)

Where you can give no keys, one key, or several to deduplicate by, as well as indicating whether the data is already sorted or should be sorted (basically a wrapper on petl's presorted parameter.

Essentially it would be a wrapper around the petl function that accomplish this, ie:

from petl import etl
from parsons import table

people = people.to_petl()
result = etl.transform.dedup.distinct(people)
Table(result)

Medium priority - this seems like it would be very helpful but also it's not blocking anything.

jafayer commented 1 year ago

Hi there! I just submitted a PR for this :slightly_smiling_face: . Happy to respond to any feedback!