mixpanel / mixpanel-utils

Other
85 stars 62 forks source link

Deduplicate People from CSV preserves record with oldest `last_seen` #27

Closed dylan-kinsa closed 3 years ago

dylan-kinsa commented 3 years ago

Expected Behavior

Per this docstring, it is expected that deduplicate_people will keep the most recently seen record when deduplication, and delete the others:

Determines duplicate profiles based on the value of a specified property. _The profile with the latest $lastseen is kept and the others are deleted. Optionally adds any properties from the profiles to be deleted to the remaining profile using $set_once. Backup files are always created.

Actual Behavior

The opposite behavior occurs, i.e. the record with the oldest $last_seen is preserved, the most recent is deleted. Using the following CSV data:

$distinct_id,$properties.$name,$properties.$email,$properties.$last_seen
id-2,undefined,undefined,2021-01-13T04:28:48
id-1,undefined,undefined,2021-01-13T04:28:34

Note that the line item with $distinct_id: id-2 should have been preserved, but it's properties were merged into id-1 and deleted.

dylan-kinsa commented 3 years ago

After more tests, when using a CSV as the input, it looks like it treats the first record as the duplicate to be removed. So in this configuration, id-2 would have been kept:

$distinct_id,$properties.$name,$properties.$email,$properties.$last_seen
id-1,undefined,undefined,2021-01-13T04:28:34
id-2,undefined,undefined,2021-01-13T04:28:48

This should probably be documented for the profiles param

jaredmixpanel commented 3 years ago

@dylan-kinsa I was not able to reproduce this behavior. I believe you were able to work through this issue with my colleague Sam on the support team, so I'm going to go ahead and close this, but feel free to re-open with a more detailed description of how exactly to reproduce if you continue to experience this problem.