jorritfolmer / TA-dmarc

Add-on for ingesting DMARC aggregate reports into Splunk
https://splunkbase.splunk.com/app/3752
15 stars 8 forks source link

Re-indexing of entire POP3-mailbox after removing some messages #36

Closed COshmyan closed 3 years ago

COshmyan commented 3 years ago

It seems that I found once more bug (at least, in POP3 processing, I did not check the IMAP).

The pop2dir.py script uses the uidl() method to obtain UID's of every e-mail message. The filter_seen_messages() method uses these UID's to define: if these messages were processed before or not. The same UID's are used also in other methods (get_dmarc_message_bodies() to retrieve the content of message, save_reports_from_message_bodies() to save the attachment to the file, save_check_point() to save checkpoint for this message).

The problem is the following. Lines returned by uidl() method consist of 2 parts (separated by a space). The second part is really UID (uniq message identifier, constant for this message). However, the first part is just a message sequence number (starting from 1): this number is valid for the current POP3 session only. It can be used in this session only (to retrieve or delete some specific message). Really, the UIDL command is used to match the UID's (permanent for each message) to their numbers (specific for this session only).

In result, when some messages deleted in the mailbox (usually, the oldest), the UIDL command will return an absolute different result in the next POP3 session (as all the rest of e-mails in mailbox will be re-numbered again, starting from 1).

By the way, the get_dmarc_message_bodies() method uses the only first part (sequence number of message) calling the retr() method (it's OK). However, for making a decision about message uniqueness, it's necessary to take into account the second part of UIDL only (really message UID). The first part (specific for the current POP3-session) should be ignored.

It's strange for me that this evident problem was not discovered up to now. If you agree, I'll try to prepare a patch for this problem.

jorritfolmer commented 3 years ago

Ha! Thanks again! I think you are the first serious user of the POP3 input! I usually only test the happy flow, manually, once, after changes. I'd be happy to prepare a new version next weekend if you provide a patch.

COshmyan commented 3 years ago

screenshot-2021-05-27_02 This screenshot illustrates this problem. The e-mail with UID="13B05D596B7336601AB00E32A0B339DB" was a number 10 in previous POP3 sessions. However, when 9 oldest messages was deleted from the mailbox, all the rest of messages were re-numbered; and the same message became number 1. As the entire UIDL (i.e. number + UID) was used as a uniqueness characteristic, it has been processed again (together with all other messages in mailbox).

COshmyan commented 3 years ago

OK, thanks. Patch is ready (2 lines to fix); but I need to test it. Probably, I'll publish it tomorrow.

COshmyan commented 3 years ago

OK, thanks. Patch is ready (2 lines to fix); but I need to test it. Probably, I'll publish it tomorrow.

Tested successfully (the pull request is here: https://github.com/jorritfolmer/TA-dmarc/pull/37). Unfortunately, I did not found how to edit my previous pull request; but this patch is very simple, so I'm sure there will be no problem to consolidate both patches together.

jorritfolmer commented 3 years ago

Merged!