jtablesaw / tablesaw

Java dataframe and visualization library
https://jtablesaw.github.io/tablesaw/
Apache License 2.0
3.53k stars 641 forks source link

Duplicate Rows May Remain After dropDuplicateRows Due to Early Return in isDuplicate #1248

Open vlevy-pci opened 7 months ago

vlevy-pci commented 7 months ago

Description: When using dropDuplicateRows to eliminate duplicate entries from a table, I observed that duplicates were still present in the output. Upon investigation, the root cause was identified in the isDuplicate function. This function is designed to iterate over rows that share a hash with the row being evaluated to determine if it is a duplicate. However, it incorrectly returns false (indicating the row is unique) during the first iteration if the first checked row does not match, without examining the remaining rows.

Expected Behavior: The isDuplicate function should only return false after all rows with the matching hash have been checked and none are found to be identical to the row being evaluated. This ensures that a row is only considered unique if it has been verified against all potential duplicates.

Actual Behavior: The function returns false prematurely after comparing with the first row that shares a hash, potentially leaving unexamined duplicates in the table.

Resolution: The issue was resolved by modifying isDuplicate to complete its iteration over all rows with a matching hash before deciding that the row is not a duplicate. This change ensured that dropDuplicateRows correctly removed all duplicates from the table.

frankwondon commented 7 months ago

这是来自QQ邮箱的假期自动回复邮件。你好,我最近正在休假中,无法亲自回复你的邮件。我将在假期结束后,尽快给你回复。

frankzengjj commented 3 months ago

has this issue been taken? if not, I would like to work on it.

vlevy-pci commented 3 months ago

Hi Frank,

I wrote a fix for my project but I have not submitted a PR for it. Please feel free to take it over. Hopefully it will be straightforward to work it from my description, but if you want my version as a reference, you are welcome to it.

Best regards, Vic

From: Frank Tianyu Zeng @.> Sent: Wednesday, June 5, 2024 11:46 PM To: jtablesaw/tablesaw @.> Cc: Vic Levy @.>; Author @.> Subject: Re: [jtablesaw/tablesaw] Duplicate Rows May Remain After dropDuplicateRows Due to Early Return in isDuplicate (Issue #1248)

has this issue been taken? if not, I would like to work on it.

— Reply to this email directly, view it on GitHub https://github.com/jtablesaw/tablesaw/issues/1248#issuecomment-2151356460 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AK2UY3H2CUDAK2HBA6M6WB3ZF7LO7AVCNFSM6AAAAABDHK5MYWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJRGM2TMNBWGA . You are receiving this because you authored the thread. https://github.com/notifications/beacon/AK2UY3DTH5P4LEV6VCGPX5DZF7LO7A5CNFSM6AAAAABDHK5MYWWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTUAHMMCY.gif Message ID: @. @.> >