Open turian opened 10 months ago
One other thing that isn't clear from the documentation:
If two items tie, e.g. have the same datestamp, is a tiebreak made. This would be logical, but a strict reading of the documentation would be that BOTH emails are selected.
Meaning, if 1A and 1B have identical timestamps, are BOTH selected and acted upon? Or just one, for actions that typically select one message.
Just to followup, I still could not determine the behavior. I used GPT4 and plugged in each file, trying to see if I could determine code that would answer my question. However, I was unable to determine which code directly addresses the handling of unique emails in the deduplication process or the resolution of ties in duplicate selection.
So I am constucting a toy mbox to understand the behavior, but now I am more confused than ever:
From test@example.com Thu Jan 1 00:00:00 2021
Subject: Duplicate Email 1
Date: Thu, 1 Jan 2021 00:00:00 +0000
From: test@example.com
To: recipient@example.com
This is a duplicate email.
From test@example.com Thu Jan 1 00:00:00 2021
Subject: Duplicate Email 1
Date: Thu, 1 Jan 2021 00:00:00 +0000
From: test@example.com
To: recipient@example.com
This is a duplicate email.
From test@example.com Thu Jan 1 00:01:00 2021
Subject: Slightly Different Email
Date: Thu, 1 Jan 2021 00:01:00 +0000
From: test@example.com
To: recipient@example.com
This email is slightly different.
From test@example.com Thu Jan 1 00:02:00 2021
Subject: Unique Email
Date: Thu, 1 Jan 2021 00:02:00 +0000
From: test@example.com
To: recipient@example.com
This is a unique email.
Giving:
โ Step #5 - Report and statistics
โโโโโโโโโโโโโโคโโโโโโโโโคโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Mails โ Metric โ Description โ
โโโโโโโโโโโโโโชโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโก
โ Found โ 4 โ Total number of mails encountered from all mail sources. โ
โโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Rejected โ 0 โ Number of mails rejected individually because they were โ
โ โ โ unparseable or did not have enough metadata to compute โ
โ โ โ hashes. โ
โโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Retained โ 4 โ Number of valid mails parsed and retained for deduplication. โ
โโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Hashes โ 3 โ Number of unique hashes. โ
โโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Unique โ 0 โ Number of unique mails (which where automatically added to โ
โ โ โ selection). โ
โโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Duplicates โ 4 โ Number of duplicate mails (sum of mails in all duplicate โ
โ โ โ sets with at least 2 mails). โ
โโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Skipped โ 4 โ Number of mails ignored in the selection step because the โ
โ โ โ whole set they belong to was skipped. โ
โโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Discarded โ 0 โ Number of mails discarded from the final selection. โ
โโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Selected โ 0 โ Number of mails kept in the final selection on which the โ
โ โ โ action will be performed. โ
โโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Copied โ 0 โ Number of mails copied from their original mailbox to โ
โ โ โ another. โ
โโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Moved โ 0 โ Number of mails moved from their original mailbox to โ
โ โ โ another. โ
โโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Deleted โ 0 โ Number of mails deleted from their mailbox in-place. โ
โโโโโโโโโโโโโโงโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโคโโโโโโโโโคโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Duplicate sets โ Metric โ Description โ
โโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโก
โ Total โ 3 โ Total number of duplicate sets. โ
โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Single โ 0 โ Total number of sets containing only a single mail with no โ
โ โ โ applicable strategy. They were automatically kept in the โ
โ โ โ final selection. โ
โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Skipped - Encoding โ 0 โ Number of sets skipped from the selection process because โ
โ โ โ they had encoding issues. โ
โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Skipped - Size โ 0 โ Number of sets skipped from the selection process because โ
โ โ โ they were too dissimilar in size. โ
โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Skipped - Content โ 0 โ Number of sets skipped from the selection process because โ
โ โ โ they were too dissimilar in content. โ
โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Skipped - Strategy โ 3 โ Number of sets skipped from the selection process because โ
โ โ โ the strategy could not be applied. โ
โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Deduplicated โ 0 โ Number of valid sets on which the selection strategy was โ
โ โ โ successfully applied. โ
โโโโโโโโโโโโโโโโโโโโโโงโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
This suggests:
Anyway, what is clear is that all emails are selected, and move-discarded
thus moved none. So move-selected
should move ALL of them, right? But I do the same command with move-selected
and nothing happens and mbox is unchanged!
โ Step #3 - Select mails in each group
info: select-newest strategy will be applied on each duplicate set to select candidates.
info: โผ 2 mails sharing hash 05a3285c1254315fa50966ae1bed99e47ab51a592d9e728a7a70e526
info: Check mail differences are below the thresholds.
info: Select all mails sharing the newest 1609459200 timestamp...
warning: Skip set: all 2 mails within were selected. The strategy criterion was not able to discard some.
info: Check mail differences are below the thresholds.
info: Select all mails sharing the newest 1609459260 timestamp...
warning: Skip set: all 1 mails within were selected. The strategy criterion was not able to discard some.
info: Check mail differences are below the thresholds.
info: Select all mails sharing the newest 1609459320 timestamp...
warning: Skip set: all 1 mails within were selected. The strategy criterion was not able to discard some.
โ Step #4 - Perform action on selected mails
info: Perform move-selected action...
warning: No mail selected to perform action on.
โ Step #5 - Report and statistics
โโโโโโโโโโโโโโคโโโโโโโโโคโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Mails โ Metric โ Description โ
โโโโโโโโโโโโโโชโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโก
โ Found โ 4 โ Total number of mails encountered from all mail sources. โ
โโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Rejected โ 0 โ Number of mails rejected individually because they were โ
โ โ โ unparseable or did not have enough metadata to compute โ
โ โ โ hashes. โ
โโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Retained โ 4 โ Number of valid mails parsed and retained for deduplication. โ
โโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Hashes โ 3 โ Number of unique hashes. โ
โโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Unique โ 0 โ Number of unique mails (which where automatically added to โ
โ โ โ selection). โ
โโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Duplicates โ 4 โ Number of duplicate mails (sum of mails in all duplicate โ
โ โ โ sets with at least 2 mails). โ
โโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Skipped โ 4 โ Number of mails ignored in the selection step because the โ
โ โ โ whole set they belong to was skipped. โ
โโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Discarded โ 0 โ Number of mails discarded from the final selection. โ
โโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Selected โ 0 โ Number of mails kept in the final selection on which the โ
โ โ โ action will be performed. โ
โโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Copied โ 0 โ Number of mails copied from their original mailbox to โ
โ โ โ another. โ
โโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Moved โ 0 โ Number of mails moved from their original mailbox to โ
โ โ โ another. โ
โโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Deleted โ 0 โ Number of mails deleted from their mailbox in-place. โ
โโโโโโโโโโโโโโงโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโคโโโโโโโโโคโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Duplicate sets โ Metric โ Description โ
โโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโก
โ Total โ 3 โ Total number of duplicate sets. โ
โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Single โ 0 โ Total number of sets containing only a single mail with no โ
โ โ โ applicable strategy. They were automatically kept in the โ
โ โ โ final selection. โ
โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Skipped - Encoding โ 0 โ Number of sets skipped from the selection process because โ
โ โ โ they had encoding issues. โ
โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Skipped - Size โ 0 โ Number of sets skipped from the selection process because โ
โ โ โ they were too dissimilar in size. โ
โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Skipped - Content โ 0 โ Number of sets skipped from the selection process because โ
โ โ โ they were too dissimilar in content. โ
โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Skipped - Strategy โ 3 โ Number of sets skipped from the selection process because โ
โ โ โ the strategy could not be applied. โ
โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Deduplicated โ 0 โ Number of valid sets on which the selection strategy was โ
โ โ โ successfully applied. โ
โโโโโโโโโโโโโโโโโโโโโโงโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Is your feature request related to a problem? Please describe.
I have several mailboxes, with many duplicates.
I want to create a new mailbox, with all de-duplicated mail from the old mailboxes, including non-duplicates.
Documentation confusion I'm puzzling over the documentation, because it is not really clear what "selected" and "discarded" mean.
Let's say there are emails 1A, 1B, and 2. (1A and 1B are duplicates in different mailboxes.)
Whatever strategy I choose, 1A and 1B are compared and one is selected and the other is discarded.
But what happens to 2? a) Has no hash matches so it is never compared, or selected, and isn't copied to my new mailbox. Then I am stuck on how to solve my problem. b) There is always a "selected" mail, even if it is unique and has no hash matches.
Can you please clarify? (I also think a documentation update would help. I read over the main docs and didn't understand, which is why I post.)