move-coop / parsons

A python library of connectors for the progressive community.
https://www.parsonsproject.org/
Other
261 stars 132 forks source link

Implement, test, and document Parsons Table deduplicate method #842

Closed jafayer closed 1 year ago

jafayer commented 1 year ago

This PR resolves #820.

Summary of changes:

parsons/etl/etl.py: adds a deduplicate method which optionally accepts a key or list of keys to deduplicate on, and an optional flag to indicate whether the method should first sort the table. docs/table.rst: adds deduplicate method to Table documentation. test/test_etl.py: adds several deduplicate unit tests for various scenarios involving zero, one, or multiple keys on both sorted and unsorted tables.

This method is essentially a wrapper around the built-in petl.transform.dedup.distinct method and extends the functionality natively to the Parsons Table class. This solution also heavily references the implementation provided in the original issue by @shaunagm!

One sticky point on terminology

As suggested in the original issue, this method accepts an optional sort parameter to instruct petl (imperatively) whether to first sort the table. However, this is more or less the opposite definition used internally by the petl library. In petl, the same argument is called presorted which (declaratively) describes a table as either presorted or not presorted — if it's not presorted, petl sorts it.

It's a small detail, but anyone who is used to the petl definition may be confused by this. A sort=True value (meaning, "this table is not yet sorted, sort it for me") is the exact opposite of a presorted=True value in petl (meaning, "this table is sorted, don't sort it"). In the code, this is represented by an explicit inversion (presorted=not sort) on line 1186 when it's passed into petl's method. I've also included a comment explaining this for anyone who might be confused by the definitions.

For one, I totally agree that "sort" makes more sense here to instruct the deduplicate method to sort the array, but I'm newer to the community and wanted to note here that it might run against what someone might expect if they're used to interacting with petl and Parsons Tables directly.

Happy to adjust any of the above!

shaunagm commented 1 year ago

Amazing, thank you Josh! I'm just about done for the day but will test this out early next week.

FYI, the build fail seems to be a linting error:

Screenshot from 2023-06-16 16-48-01

jafayer commented 1 year ago

Ahh apologies, it looks like I only ran the linter on the parsons directory. I've corrected the error locally and run the linters again but before I add it to the PR, just want to make sure I'm following whatever conventions work best for the community.

Would you prefer I rebase/squash the change into the initial commit or make another commit on this PR to preserve the history?

Thank you!!

shaunagm commented 1 year ago

Whatever you prefer. We're not particularly careful about the commit history.

jafayer commented 1 year ago

Just pushed a revised version that fixed one bug and updated the documentation to clarify the key behavior in the docstring!

Just flagging that in my initial implementation, I accidentally had the function returning the Parsons Table's internal petl table (i.e. return self.table) instead of the table itself (i.e. return self). I've fixed that in this push! Now you should be able to chain operators so you can do tbl.deduplicate('a').deduplicate('b') if deduplicating on the set of (a,b) isn't what you want.

Let me know if any of the docs or examples I've typed up are unclear, happy to adjust anything! :slightly_smiling_face: