quentinsf / IMAPdedup

IMAP message de-duplicator
https://quentinsf.com/software/imapdedup
GNU General Public License v2.0
321 stars 57 forks source link

Fails on large message count #41

Open vortek opened 6 years ago

vortek commented 6 years ago

It worked for all the folders. Then I did the dry-run for the INBOX folder and it found 113000 duplicates. When i remove the -n option it fails. If I try the dry-run again now it also fails.

$ ./imapdedup.py -s mail.server.com -u user@mail.com -x l
Password:
Spam
Drafts
Deleted Items
Sent
INBOX
$ ./imapdedup.py -s mail.server.com -u user@mail.com -x INBOX
Password: 
There are 170714 messages in INBOX.
No message(s) currently marked as deleted in INBOX
170714 others in INBOX
Traceback (most recent call last):
  File "./imapdedup.py", line 324, in <module>
    main(sys.argv[1:])
  File "./imapdedup.py", line 321, in main
    process(options, mboxes)
  File "./imapdedup.py", line 248, in process
    ms = check_response(server.fetch(message_ids, '(RFC822.HEADER)'))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/imaplib.py", line 456, in fetch
    typ, dat = self._simple_command(name, message_set, message_parts)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/imaplib.py", line 1088, in _simple_command
    return self._command_complete(name, self._command(name, *args))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/imaplib.py", line 912, in _command_complete
    raise self.abort('command: %s => %s' % (name, val))
imaplib.abort: command: FETCH => socket error: EOF
quentinsf commented 6 years ago

Hi João,

It looks as if you may have hit some limit on your server, or maybe it's timing out. I'd need to look more carefully at this and I'm afraid I'm not likely to manage that in the near future.

If you need a temporary fix, can I suggest splitting your inbox into folders, e.g. by year, running the program against each folder, and then (if you really want an inbox that large!) recombining them again?

Best, Quentin

vortek commented 6 years ago

Hello Quentin, How do you suggest that I split the inbox? Thanks!

quentinsf commented 6 years ago

Well, there are ways you could script it, but I would just use an email program to create a new folder, select all the messages in one year, and move them over. Then do the next year...

Depending on your email client, you may be able to do something clever with smart mailboxes to make the selection process easier...

vortek commented 6 years ago

Thanks for the tips!

Bill48105 commented 6 years ago

I ran into this as well doing an inbox with 300K+ messages. (Don't ask..) First run was great it deleted 100K dupes & I was excited but there were still dupes showing up in roundcube so I figured I'd run it again but I'd get that EOF error on the same fetch headers line. I changed (RFC822.HEADER) to (BODY.PEEK[HEADER]) and it worked again for 1 run. Then the dreaded EOF error every run after. So I edited (BODY.PEEK[HEADER]) back to (RFC822.HEADER) and it worked.. For one run.. Until I let it sit awhile & it worked again.. For 1 run then EOF. By that time it was clear something funky was up so I decided to dig deeper to try & narrow it down. While I did many things including adjust MAXLINE and wrap the IMAP commands in try/except hoping it'd continue (it doesn't) it wasn't until I enabled debugging with imaplib.Debug & m.debug = True I finally got a big clue as to what was going on:

35:55.56 BYE response: Server shutting down.

So yeah umm seems the remote server is shutting down mid session? That'd explain why it works after editing (time passed allowing the server to be online again) And note it happened on folder with only 39 messages.. I had changed to another folder with fewer messages to try & narrow down the issue. I thought it was a fluke but was able to reproduce this shutting down bit multiple times.

Many guesses as to what is up from corrupt messages on server to overloading server to bug in python imaplib to who knows but clearly there's an issue, just can't say it's in IMAPdedup (in fact it's not in that I get similar issues with other programs/scripts) beyond maybe it'd be helpful if it better handled & recovered.

Btw not sure about OP but in my case this is all on InMotion shared business hosting which is Dovecot:

EDIT: Ok seems maybe that's syslog rate limiting in that post so maybe unrelated & weird coincidence.. Little searching & maybe it's rate limiting: "server dovecot: imap(account@tld.com): Server shutting down. in=7140 out=70598" https://www.howtoforge.com/community/threads/server-dovecot-imap-account-tld-com-server-shutting-down-in-7140-out-70598.74887/

If that's the case maybe need option to limit max # of messages it does at a time and/or add sleeps in the loop to help?

shubhammatta commented 6 years ago

1150749 others in INBOX 30:37.28 > LCJD5 FETCH 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100 (RFC822.HEADER) 30:38.69 last 0 IMAP4 interactions: 30:38.69 > LCJD6 LOGOUT 30:38.69 last 0 IMAP4 interactions: Traceback (most recent call last): File "imapdedup.py", line 324, in main(sys.argv[1:]) File "imapdedup.py", line 321, in main process(options, mboxes) File "imapdedup.py", line 248, in process ms = check_response(server.fetch(message_ids, '(RFC822.HEADER)')) File "/root/daily_build/64_23/4.3.4/SysUtil/Python-2.7.5-cross/install_path_full/lib/python2.7/imaplib.py", line 443, in fetch File "/root/daily_build/64_23/4.3.4/SysUtil/Python-2.7.5-cross/install_path_full/lib/python2.7/imaplib.py", line 1070, in _simple_command File "/root/daily_build/64_23/4.3.4/SysUtil/Python-2.7.5-cross/install_path_full/lib/python2.7/imaplib.py", line 899, in _command_complete imaplib.abort: command: FETCH => socket error: EOF

I turned the imaplib debug on. I get that INBOX has huge amount of mails but fetching result in socket error EOF. Anyone has any insights?

quentinsf commented 6 years ago

Mmm. Do you have access to the server logs?

The imaplib source says that '"abort" exceptions imply the connection should be reset, and the command re-tried.'

So perhaps that's what we should do (if anyone who can test this would like to submit a pull request!)

I guess your mail server may be very heavily loaded and timing out trying to do this even for 100 messages. However, you may be asking for problems with any IMAP server if you keep more than a million messages in a single mailbox! Not to mention using a lot of RAM on your local machine if you do manage to download even their headers...

shubhammatta commented 6 years ago

Thanks for the info. I reduced the chunksize to 1 and script ran. although if it again aborts, I will try to add the re connect part in the script. Will comment if that works. Although I wish it does not abort . Have been at it for quite some time now.