rtucker / imap2maildir

Backs up an IMAP mailbox to a maildir. Useful for backing up mail stored on free webmail providers, etc.
http://blog.hoopycat.com/index.php/2009/07/04/imap2maildir-a-tool-for-mirroring-imap-t
MIT License
98 stars 22 forks source link

Failure for ~200k messages #6

Open viric opened 13 years ago

viric commented 13 years ago

Hello,

having four or five stops, I could end up downloading 47k of messages of 200k: $ ls gmailbackup/new/ | wc -l 43815

I run the command, to grab all the messages: $ python imap2maildir -u xxxxxx -r "[Gmail]/Tots els missatges" -s ALL --create -v -d gmailbackup

and for every run, I'm asked the password, and then it goes: Opening sqlite3 database 'gmailbackup/.imap2maildir.sqlite' Synchronizing 199663 messages from imap.gmail.com:[Gmail]/Tots els missatges to /home/llbatlle/tmp/rtucker-imap2maildir-fa0abe3/gmailbackup... TURBO MODE ENGAGED! Exception! Clearing locks and safing database. Traceback (most recent call last): File "imap2maildir", line 495, in main() File "imap2maildir", line 476, in main search=options.search) File "imap2maildir", line 396, in copy_messages_by_folder for i in folder.Summaries(search=search): File "/home/llbatlle/tmp/rtucker-imap2maildir-fa0abe3/simpleimap.py", line 357, in Summaries summ = self.__parent.get_summary_by_uid(u) File "/home/llbatlle/tmp/rtucker-imap2maildir-fa0abe3/simpleimap.py", line 256, in get_summary_by_uid '(UID ENVELOPE RFC822.SIZE INTERNALDATE)') File "/nix/store/qlmlvbsgb3q8iqlhkc7j8m6f9z71sbd6-python-2.6.5/lib/python2.6/imaplib.py", line 753, in uid typ, dat = self._simple_command(name, command, args) File "/nix/store/qlmlvbsgb3q8iqlhkc7j8m6f9z71sbd6-python-2.6.5/lib/python2.6/imaplib.py", line 1060, in _simple_command return self._command_complete(name, self._command(name, args)) File "/nix/store/qlmlvbsgb3q8iqlhkc7j8m6f9z71sbd6-python-2.6.5/lib/python2.6/imaplib.py", line 890, in _command_complete raise self.abort('command: %s => %s' % (name, val)) imaplib.abort: command: UID => socket error: unterminated line

I cannot download anymore. It takes quite a lot of time until the error appears. Can it be that gmail disconnects due to an inactivity timeout?

viric commented 13 years ago

I notice that in checkmessage() the turbo mode does an sql select query for every possible message to check if the message is there. This is a lot of work; I think that it would be far better to get the list into memory into an appropiate searchable structure, and do the check there.

rtucker commented 13 years ago

I've run into a couple cases where a specific message is "corrupted" on gmail's end, and trying to fetch it via IMAP fails. In simpleimap.py, putting a try/except around the get_summary_by_uid should find the IMAP UID that is choking it:

try:
    summ = self.__parent.get_summary_by_uid(u)
except:
    print "uid", u
    raise

Once you have that, it should be possible to delete the offending message.

It should be doing a better job of handling errors such as these. And yes, it is doing a SQL query for each UID... I don't remember why I did it that way, but I think memory consumption was a concern. On second thought, it shouldn't take THAT much memory, and it would likely improve performance a lot. :-) Good catch.

viric commented 13 years ago

Gmail simply closes the socket due to that much inactivity during the first stage of the TURBO MODE.

Once having the list of uids on memory, and checking there instead of by a sql query per uid, I think the turbo mode will work great.

I'm trying without turbo mode, but gmail disconnects me before I can reach even the 15% of my mail.

rtucker commented 13 years ago

Well.

On my gmail mailbox of ~145,000 messages, Last night's run: about 3.75 hours With a cache: 7 minutes, 22 seconds

Pull in the latest HEAD and let me know how that works for you.

viric commented 13 years ago

I just tried. I got, with turbo mode, with the old maildir directory that had some letters:

Exception!  Clearing locks and safing database.
Traceback (most recent call last):
  File "./imap2maildir", line 536, in 
    main()
  File "./imap2maildir", line 517, in main
    seencache=seencache)
  File "./imap2maildir", line 435, in copy_messages_by_folder
    for i in folder.Summaries(search=search):
  File "/home/llbatlle/tmp/imap2maildir/simpleimap.py", line 357, in Summaries
    summ = self.__parent.get_summary_by_uid(u)
  File "/home/llbatlle/tmp/imap2maildir/simpleimap.py", line 256, in get_summary_by_uid
    '(UID ENVELOPE RFC822.SIZE INTERNALDATE)')
  File "/nix/store/hd089201zv5fb1lqdxscv194snnynplj-python-2.7/lib/python2.7/imaplib.py", line 753, in uid
    typ, dat = self._simple_command(name, command, *args)
  File "/nix/store/hd089201zv5fb1lqdxscv194snnynplj-python-2.7/lib/python2.7/imaplib.py", line 1060, in _simple_command
    return self._command_complete(name, self._command(name, *args))
  File "/nix/store/hd089201zv5fb1lqdxscv194snnynplj-python-2.7/lib/python2.7/imaplib.py", line 890, in _command_complete
    raise self.abort('command: %s => %s' % (name, val))
imaplib.abort: command: UID => socket error: unterminated line

I am not very good at python, so sorry if I don't get more into details of the code. :) I will try again creating a new maildir.

rtucker commented 13 years ago

Well, at least it should be faster to test :-)

I just pushed a patch that will spit out the UID it choked on. Once you have that UID, you can try firing up Python and seeing if you can figure out what's wrong with the message:

import simpleimap
server = simpleimap.Server(hostname='imap.gmail.com', username='rtucker@gmail.com', password='blah').Get()
server.select('[Gmail]/All Mail')
server.uid('FETCH', 376544, '(RFC822)')

... would spit out message uid 376544. Try the neighboring messages (presumably 376543 and 376545) as well. You can also try:

    server.uid('FETCH', 376544, '(UID ENVELOPE RFC822.SIZE INTERNALDATE)')

to see what that does, since that's what it is trying to do when it crashes.

imap2maildir could easily ignore this exception and have it continue on, but I think understanding why it is happening will be a very good thing.

Thanks! -rt

viric commented 13 years ago

Here you have it:

>>> server.uid('FETCH', 165982, '(RFC822)')
('OK', [('43816 (UID 165982 RFC822 {5523}', 'Delivered-To: viriketo@gmail.com\r\nReceived: by 10.142.169.1 with SMTP id r1cs178792wfe;\r\n        Sun, 28 Sep 2008 07:49:53 -0700 (PDT)\r\nReceived: by 10.115.23.19 with SMTP id a19mr4311058waj.133.1222613393492;\r\n        Sun, 28 Sep 2008 07:49:53 -0700 (PDT)\r\nReturn-Path: \r\nReceived: from n16a.bullet.sp1.yahoo.com (n16a.bullet.sp1.yahoo.com [69.147.64.121])\r\n        by mx.google.com with SMTP id t1si2136057poh.13.2008.09.28.07.49.52;\r\n        Sun, 28 Sep 2008 07:49:52 -0700 (PDT)\r\nReceived-SPF: pass (google.com: domain of sentto-9862331-5848-1222613385-viriketo=gmail.com@returns.groups.yahoo.com designates 69.147.64.121 as permitted sender) client-ip=69.147.64.121;\r\nDomainKey-Status: good\r\nAuthentication-Results: mx.google.com; spf=pass (google.com: domain of sentto-9862331-5848-1222613385-viriketo=gmail.com@returns.groups.yahoo.com designates 69.147.64.121 as permitted sender) smtp.mail=sentto-9862331-5848-1222613385-viriketo=gmail.com@returns.groups.yahoo.com; domainkeys=pass header.From=tradukado@yahoogroups.com\r\nComment: DomainKeys? See http://antispam.yahoo.com/domainkeys\r\nDomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=lima; d=yahoogroups.com;\r\n\tb=LSlgVDUGFtooqe064kt32c5atqJ2pBA+7kklkoqGGl95lG8xCcl8wjfXI6G5C61jPvg4vE0TWl1f2ZdNkYh5Xeade6B9I0le2BqDz8bMtZLINLIKi8XRYyp1pFTQEyGw;\r\nReceived: from [69.147.65.171] by n16.bullet.sp1.yahoo.com with NNFMP; 28 Sep 2008 14:49:45 -0000\r\nReceived: from [66.218.67.109] by t13.bullet.mail.sp1.yahoo.com with NNFMP; 28 Sep 2008 14:49:45 -0000\r\nX-Yahoo-Newman-Id: 9862331-m5848\r\nX-Sender: jorgos@aliceadsl.fr\r\nX-Apparently-To: tradukado@yahoogroups.com\r\nX-Received: (qmail 68424 invoked from network); 28 Sep 2008 14:49:42 -0000\r\nX-Received: from unknown (66.218.67.96)\r\n  by m45.grp.scd.yahoo.com with QMQP; 28 Sep 2008 14:49:42 -0000\r\nX-Received: from unknown (HELO mail.libertysurf.net) (213.36.80.105)\r\n  by mta17.grp.scd.yahoo.com with SMTP; 28 Sep 2008 14:49:42 -0000\r\nX-Received: from aliceadsl.fr (192.168.10.57) by mail.libertysurf.net (8.0.015)\r\n        id 482DC6AA00F031DC for tradukado@yahoogroups.com; Sun, 28 Sep 2008 16:49:42 +0200\r\nMessage-Id: \r\nX-Sensitivity: 3\r\nTo: "=?iso-8859-1?Q?tradukado?=" \r\nX-XaM3-API-Version: 3.2 R18 (B34 pl1)\r\nX-type: 0\r\nX-SenderIP: 91.171.195.43\r\nX-Originating-IP: 213.36.80.105\r\nX-eGroups-Msg-Info: 1:12:0:0:0\r\nFrom: "=?iso-8859-1?Q?jorgos@aliceadsl.fr?=" \r\nX-Yahoo-Profile: jorgos_esperanto\r\nSender: tradukado@yahoogroups.com\r\nMIME-Version: 1.0\r\nMailing-List: list tradukado@yahoogroups.com; contact tradukado-owner@yahoogroups.com\r\nDelivered-To: mailing list tradukado@yahoogroups.com\r\nList-Id: \r\nPrecedence: bulk\r\nList-Unsubscribe: \r\nDate: Sun, 28 Sep 2008 16:49:42 +0200\r\nSubject: =?iso-8859-1?Q?Re:[tradukado]_verboj_por_tabulaj_sportoj_(surftabulo,\r\n\t_negxtabulo,_rultabulo,_ktp)?=\r\nReply-To: tradukado@yahoogroups.com\r\nX-Yahoo-Newman-Property: groups-email-tradt-m\r\nContent-Type: text/plain; charset=ISO-8859-1\r\nContent-Transfer-Encoding: quoted-printable\r\n\r\nOni jam delonge neplu biciklumas au gitarludas sed biciklas=0D\r\nkaj gitaras (kvankam ne mem estas biciklo au gitaro) kaj=0D\r\npraktikas bicikladon kaj gitaradon, ^cu ne ? ; nu kial ne ? =0D\r\n=0D\r\n^Ciu elektu mem kaj la popolo decidos tion, kion akcepti...=0D\r\n=0D\r\nJs.=0D\r\n=0D\r\ntradukado, 28 Sep 2008 : verboj por tabulaj sportoj=0D\r\n(surftabulo, negxtabulo, rultabulo, ktp)=0D\r\n=0D\r\nSaluton,=0D\r\nkiel vi verbe esprimus la diversajn X-tabulan sportojn, ekz=0D\r\nuzon de=0D\r\nsurftabulo, negxtabulo, rultabulo, ktp?=0D\r\n1. simple verbigu la substantivon, kompreneble!=0D\r\nsurftabuli, negxtabuli, rultabuli, ...  Do "Li X-tabulas."=0D\r\n2. ne ne, tia verba formo de "tabul-" sensencas aux sugestas=0D\r\nke la=0D\r\nsubjekto ESTAS tia tabulo, do necesas aldoni -um al la=0D\r\nsubstantivo:=0D\r\nsurftabulumi, negxtabulumi, rultabulumi, ... Do "Li X-tabulumas"=0D\r\n3. ne eblas verbigi tiel, oni bezonas uzi ian verbon kun la=0D\r\nsubstantivo: rajdi surftabulon, gliti sur negxtabulo, veturi=0D\r\nper rultabulo, ... Do "Li iras per X-tabulo" aux "Li iras=0D\r\nX-tabule" ktp=0D\r\n4. io alia...?=0D\r\nKiel oni nomu la agadojn substantive?=0D\r\n1. surftabulado, negxtabulado, rultabulado, ...=0D\r\n2. surftabulumado, negxtabulumado, rultabulumado, ...=0D\r\n3. surftabulrajdado, negxtabulglitado, rultabulveturado, ...=0D\r\n4. io alia...?=0D\r\ndankon,    russ=0D\r\n\r\n\r\n\r\n---------------------- ALICE N=B01 de la RELATION CLIENT 2008*-------------=\r\n-------\r\nD=E9couvrez vite l\'offre exclusive ALICE BOX! En cliquant ici http://abonne=\r\nment.aliceadsl.fr Offre soumise =E0 conditions.*Source : TNS SOFRES / BEARI=\r\nNG POINT. Secteur Fournisseur d.Acc=E8s Internet\r\n\r\n\r\n\r\n------------------------------------\r\n\r\nYahoo! Groups Links\r\n\r\n<*> To visit your group on the web, go to:\r\n    http://groups.yahoo.com/group/tradukado/\r\n\r\n<*> Your email settings:\r\n    Individual Email | Traditional\r\n\r\n<*> To change settings online go to:\r\n    http://groups.yahoo.com/group/tradukado/join\r\n    (Yahoo! ID required)\r\n\r\n<*> To change settings via email:\r\n    mailto:tradukado-digest@yahoogroups.com=20\r\n    mailto:tradukado-fullfeatured@yahoogroups.com\r\n\r\n<*> To unsubscribe from this group, send an email to:\r\n    tradukado-unsubscribe@yahoogroups.com\r\n\r\n<*> Your use of Yahoo! Groups is subject to:\r\n    http://docs.yahoo.com/info/terms/\r\n\r\n'), ' FLAGS (\\Seen))'])

The big trouble looks like the Subject: line having a \r\n\t in the middle.

The relevant information from rfc2822 is in section 2.2.3. In short:

""" The process of moving from this folded multiple-line representation of a header field to its single line representation is called "unfolding". Unfolding is accomplished by simply removing any CRLF that is immediately followed by WSP. Each header field should be treated in its unfolded form for further syntactic and semantic evaluation. """ (I took this reference from this http://bugs.python.org/issue504152 )

viric commented 13 years ago

Sorry, I notice it is a problem of imaplib, still in python2,.7 and python3. I'll have to get around it somehow.

viric commented 13 years ago

I had the chance to investigate the issue more. My mailbox has messages from a specific person that, when he wrote long Subjects, his letters were written with an RFC 2822 violation. Instead of breaking the subject with CRLF + WSP, his letters have the subject broken only LF + WSP. That affects parsing the ENVELOPE answer, as imaplib works with readline(), and for readline() either \n or \r\n are end of lines. I wrote a patch for imaplib so I can keep on downloading. When finding a line ending in \n (not \r\n), I concatenate the next line and remove the \n\t sequence.

rtucker commented 13 years ago

Cool! I, unfortunately, haven't had a chance to look at this yet but that's probably where I was headed.

I am not opposed to working around bugs in imaplib.py using simpleimap.py... see the SimpleImapSSL class for an example of this. The process of getting a bug fixed in the Python library is very slow, and then it has to actually make it onto people's systems via Debian/Ubuntu/RHEL/CentOS/. And yes, there are more than a few such bugs.

viric commented 13 years ago

Once I success getting all my gmail mail, I'll try to write something worth sending, for that bug.

viric commented 13 years ago

Ouch - my quick hack worked for the case I had, but I got a new more difficult to defeat, also failing in the python library, not your code: Date: Sat, 12 Aug 2006 21:07:54 +0400 Subject: [EK-MASI] =?koi8-r?B?IkFydG8ga2FqIGFrdGl2ZWNvIg0KDQojRWtvdG9waW8gMjAwNiBaYWplanhv?= =?koi8-r?B?dmEgU2xvdmFraW8j?=

rtucker commented 13 years ago

Niiiice!

See my comment on Issue #10 -- having the "raw" response from the IMAP server helps with testing the weird ones.