Open GoogleCodeExporter opened 8 years ago
Another solution is to shorten all file names to a certain 'reasonable' limit,
e.g. 128 characters. The full, correct subject of the mail is still saved
inside the file, so no information is lost by doing that.
The advantage of shortening the file name to something reasonable like 128
characters is also that the backed up files will then be easier to handle by
other programs. Even if Gmail Backup could write the files with long file names
(full path > 255 characters), other file handling software might not be able to
handle those files, e.g. if you want to move them, or open them in another
program.
Also, even Gmail Backup itself might have problems reading files with long file
names when it needs to restore those mails, which is an important part of its
functionality. So if Gmail Backup is changed so it can write long file names it
must also be changed so it can read those files. Right now, it would probably
fail at that. But if the file names are shortened before writing, and the
shortened file names are stored correctly in the index files "ids.txt" and
"labels.txt", I am pretty sure Gmail Backup can also restore those mails
without further change in the software.
Original comment by jesper.h...@gmail.com
on 21 May 2013 at 3:17
I can see in gmb.py, class DirectoryStorage, that all file names are "cleaned"
with a call to _cleanFilename(), and that this function actually truncates
*part of* (but only a part of) the file path to a maximum of 240 characters:
def _cleanFilename(self, fn):
'''Cleans the filename - removes diacritics and other filesystem special characters
'''
...
ret = ret[:240]
return ret
The part that is shortened to 240 characters is this:
"2011/06/20110603-084202-breakingnews@mail_cnn_com-CNN_Breaking_News"
That is, it includes two directories too (year and month), but not the backup
folder (the path leading up to those two directories, for instance
"D:\Backup\Gmail Backup\"), and not the final part "-1.eml", where "1" could
potentially be a larger number but most often is just a one digit number.
As "-1.eml" takes 6 characters there are only 255-240-6 = 9 characters left for
the backup folder path.
So, according to these calculations:
1) If the backup folder path is bigger than 9 characters, Gmail Backup fails
when there is a long subject in a mail.
2) If the backup folder path is 9 characters or less, Gmail Backup never fails,
even when there is an extremely long subject in a mail.
A backup folder path like "D:\Backup" (9 characters) would do fine, but any
path longer than that will fail for mails with long subject lines.
So, the problem seems to be that only *part* of the file path is being
truncated and that the backup folder is not included in this truncation.
Another problem is that even if Gmail Backup fails to write to a file (for
instance because the path is too long), the email is still being marked as
having been backed up in ids.txt:
self.message_iid2fn[msg_iid] = msg_fn_num
self.message_fn2iid[msg_fn_num] = msg_iid
fw = file(full_fn_num, 'wb')
try:
fw.write(msg)
finally:
fw.close()
This should have read:
fw = file(full_fn_num, 'wb')
try:
fw.write(msg)
finally:
fw.close()
self.message_iid2fn[msg_iid] = msg_fn_num
self.message_fn2iid[msg_fn_num] = msg_iid
so that the last two lines are only executed if the file was written
successfully.
Because of this extra bug, even if the original bug is solved, Gmail Backup
will think that the mails that previously had a too long path to be backed up
*have* already been backed up - even though they haven't. The simple way to
solve this is to delete the existing backup (or move it to a safe place
temporarily and delete it later) and make a new backup from scratch - after
these two bugs have been fixed, that is.
Original comment by jesper.h...@gmail.com
on 21 May 2013 at 5:13
Actually, I personally have 411 mails that have not been backed up but occurs
in the ids.txt file. I thought it was just a few, but it wasn't.
I ran this Python 2.7 script placed in the root of the backup folder to check
which files actually exists:
import os.path, os
def fileDoesNotExist(x): return not os.path.isfile(x)
idsList=file("ids.txt", "r").readlines() # Get the lines.
filenameList=map(lambda x: x.split("\t")[0], idsList) # Get the file names.
nonExistingFilenameList=filter(fileDoesNotExist, filenameList) # Get the file
names of non-existing files.
print len(nonExistingFilenameList) # Print the number of non-existing files.
outfilename="nonexisting.txt"
file(outfilename, "w").write("\n".join(nonExistingFilenameList))
os.startfile(outfilename)
and it gave me a list of 411 files that the "database" file ids.txt claims
exist, but that don't. That means that 411/77205 * 100 % = 0.5 % of all my
mails have not been backed up because of this bug. Note that I am not saying
this as a criticism to the programmers (after all, they gave the software away
for free and it did backup 99.5 % of my mails) but just as a fact and as
information for others to be aware of.
Of course I will return if I can find a good solution to the problem.
Original comment by jesper.h...@gmail.com
on 21 May 2013 at 11:13
Hi
Same problem with me for some time. I backup the e-mails in a local dir and
copy it to a server-wherer the problem occurs
A small update regarding this as described from other users would be very much
appreciated ... ;-))
Very best regards,
Markus
Original comment by Markus.G...@gmail.com
on 1 Jul 2013 at 12:22
Okay, I've had a look at the code and I am able to run it with some simple
bugfixes that addresses both this problem and Issue 17
(https://code.google.com/p/gmail-backup-com/issues/detail?id=17). I am testing
it right now on my own Gmail account. It will run for many hours as I am
creating a whole new backup. (As I mentioned earlier, it is necessary to create
a new backup from scratch when using the new version of the program.) Until now
the test looks promising with 4000 mails backed up without a single failure.
It also seems like I am able to compile the program and create a setup program
for Windows.
So if my test turns out successfully, I plan to (somehow) release a setup
program for Windows with the bugfixed version. I also plan to release a diff
file to show exactly what I changed in the source code.
Original comment by jesper.h...@gmail.com
on 2 Jul 2013 at 2:17
Thank you so much! Hope it all worked fine and you'll be able to provide a new
setup routine.
Original comment by Markus.G...@gmail.com
on 2 Jul 2013 at 5:58
Well, I tested it with around 67,000 emails now.
And it does look better: It saved 67,061 emails as .eml files, and all the
files have different content (I tested that), but in the database file ids.txt
only 67,034 emails were registered, which is a bit strange given that no errors
were reported. So, 27 emails out of 67,071 emails were not registered in the
database; that is 0.04%. So 99.96% of the emails were backed up and registered
correctly. That is to be compared with the 99.5% of the previous version.
And all the emails in the ids.txt database are actually present as files, which
is a big improvement. :o)
So while the program still does have bugs, it backs up more emails in "my"
version of it than in the previous version.
If anyone wants to try "my" new version, be sure to start an entirely new
backup with the new version. This is because it will cut the subject lines
shorter (140 characters shorter for the longest subject lines) and therefore
the old database files and old .eml files do not match with this version. (This
version might though be able to restore mails backed up by an earlier version
of the program. It just can't continue a backup made by an earlier version).
As it seems like the original authors are not listening in here, I am supplying
a new Windows installer and a diff to show what I changed. They are both
attached to this comment. I probably ought to fork the new version as a new
project, but as I haven't done that before I will see if I can find the energy
for that later. I guess the most important thing is to just get a (more)
working version out there. Then we can see about forking into a new project if
the original authors really don't want to continue this project. Otherwise they
can simply implement my changes into their own project.
I have given "my" version the revision number 10000 to distinguish it clearly
from the proper versioning. I would have preferred to call it "20a", but the
program crashes if the revision is not an integer :o).
I am also attaching the source code.
Original comment by jesper.h...@gmail.com
on 3 Jul 2013 at 12:41
Attachments:
saloveju20@gmail.com
Original comment by 222sa...@gmail.com
on 29 Sep 2013 at 5:07
@jesper.h...@gmail.com:
Thank you so much! It's quite strange: first I installed your version, but at
the end I had the same problem with file lengt while copying. Later I did the
same again and then it worked fine!
Thank you very much!
Original comment by Markus.G...@gmail.com
on 30 Sep 2013 at 6:55
You're very welcome!
Maybe the first time you tried my version, you didn't do a clean new backup to
an empty folder or delete the message files created by the previous version
first? Because in that case, there would still be old files with too long file
names lying around, as my version isn't as clever as to try to rename the
existing old long file names to the new, shorter ones.
It would be quite cool if the program was that clever, though. Because then the
maximum file name lengths could be adjusted at any time as a user setting, and
the program could then automagically adapt the file name lengths of the
existing files to the new maximum length. I'll put that on my list of Potential
Cool Improvements. But the most important thing for me is to make sure that
there are no major bugs in the source code, i.e. that the created backup is
both correct and complete.
I have been working on the source code for some time in the summer to try to
get an overview over it and clean it up a bit (called "refactoring"). During
that process I also found some very small bugs and fixed them. None of those
bugs had anything to do with the core backup or restore process, though.
I have been thinking of releasing the most recent version of the source code
after my refactoring with the minor bugs fixed. I would just like to test it
thoroughly first to prove (or rather make it probable) that I did not introduce
any new bugs by refactoring the source code (I don't think I did, but I would
like to prove it to a reasonable extent). I am still thinking about a good way
to do that testing, as there are some challenges connected to that.
Original comment by jesper.h...@gmail.com
on 1 Oct 2013 at 11:30
Sounds pretty good!
Unfortunately I can't remember exactly what I did before starting with your
version, but since you advised to start with a new backup I assume I did
accordingly, but am not completely sure ...
Anyway it's working perfectly for me - that only counts ... ;-)
Very best regards,
Markus
Original comment by Markus.G...@gmail.com
on 1 Oct 2013 at 9:51
I am very happy to hear that, Markus!
Very best regards to you too,
Jesper
Original comment by jesper.h...@gmail.com
on 1 Oct 2013 at 11:07
i so carzy to seek my friends
Original comment by 222sa...@gmail.com
on 15 Oct 2013 at 5:30
Hi Jesper
Could your missed (un-registered) 27 messages by chance have 140 the characters
in the name identical with some other message(s) after the shortening?
I am not yet trained enough to see if you have addressed that possibility in
your re-naming procedure.
- Just an idea!
Original comment by ener...@gmail.com
on 19 Nov 2013 at 8:25
Hi ener...,
Thank you for your suggestion! I don't think that is the case, though, because
all filenames get a dash and a number added ("-1", "-2", etc.) *after* the
shortening of the name to make sure the resulting file name is unique. The
program simply increments the number by one again and again until the resulting
file name is unique. And it is the resulting unique file name that is put in
the ids.txt file.
I am more inclined to believe that maybe the supposed "unique" IDs of some
emails are not always unique. And there is no check that all the IDs are
actually unique.
Also, there is a function in the program which contains a bug that potentially
could make two different IDs identical if one or both of the IDs contain
non-ascii characters.
Jesper
Original comment by jesper.h...@gmail.com
on 20 Nov 2013 at 4:25
Hi Jesper....
Thank you for your efforts, I am presently backing up my 60,000+ messages, and
it certainly takes a while.
Could it be an option inside the backup2000 to actually:
1) Test for the uniqueness you mention, and, if necessary, make them unique.
2) Slightly different subject: Maybe it is worth to test for the actual length
of the sub-directory that the backup shall store to, and shorten the filenames
accordingly, instead of simply nominating 140 characters or some other magic
limit, which in some cases could result in exceeding the length limit of 255.
BTW: Could the words Vojens, and Ebeltoft, by chance have any special
significance for you? .. Just curious!
Regards,
Ole Knudsen
Original comment by ener...@gmail.com
on 20 Nov 2013 at 8:28
Hi Ole,
Thank you for your suggestions!
> 1) Test for the uniqueness you mention, and, if necessary, make them unique.
Yes. If there really is a problem with conflicting IDs, we do need to find a
way to make really unique IDs.
But we can't just make up our own IDs to make the IDs unique in that way.
Because to determine whether to backup a specific email or not, the program
reads the ID of the email from the Gmail server and compares it to the IDs of
emails that have already been downloaded. If it matches an already downloaded
message, the message will not be downloaded again. But if we change the ID of a
downloaded message, we cannot compare that new local ID to the online ID
anymore, as they will for sure be different, and this would trigger a new
download of the message at every backup (and even adding the downloaded file as
a new file next to the already downloaded file).
A better option would perhaps be to always create an artificial ID composed of
several of the metadata fields of an email, including its stated ID and its
date and time stamp. That would make the risk of duplicate IDs less.
But of course, the program should still check that the resulting artificial ID
really is unique. And if not, the program should handle that situation in a
sensible way. I haven't quite figured out what sensible way that could be.
Maybe it could mark the emails that have duplicate IDs in the database in a
special way and in the future *always* download these exceptional emails at
*every* backup – simply because it cannot determine whether that email has
been backed up before. The problem with that is that it would again have to
create a new file at every backup and in that way pile up files with identical
messages.
A special problem is that every time you download the same message it can be
binary different. This happens if you have installed an antivirus scanner (such
as Avast) that inserts a special message header saying that the message was
scanned and at what date and time it was scanned. As the time is new at every
download, every downloaded message will be different – unless you filter out
headers with often used antivirus header names (such as the "X-Antivirus"
header). So in that case you can't even check to see if the downloaded file is
identical to an already downloaded file, as it never will be.
Another way to solve the problem could be to perhaps use some Gmail extension
to the IMAP protocol to get a guaranteed unique ID. Because I believe Gmail
internally must have unique IDs on all emails. But I don't know if those IDs
are avaliable..... Let me see.... Well, they are!
https://developers.google.com/gmail/imap_extensions#access_to_the_gmail_unique_m
essage_id_x-gm-msgid . This would of course lock the program even more to Gmail
(making it slightly more difficult to generalize it to handle any IMAP server),
but as it is already somewhat locked to Gmail, that wouldn't be a big problem.
So, a solution could be to switch entirely to use Gmail's internal IDs instead.
The only problem I can see with that is that people already using the program
would have to start new backups again from scratch. Unless I can come up with a
clever way of automatically updating the IDs without downloading all messages
again. Maybe I can, but it would make the change more complex to program. And
the first backup using the new version of the program would take a looooong
time, as all IDs would have to be updated. But I guess it would take an even
longer time to actually download all messages again.
> 2) Slightly different subject: Maybe it is worth to test for the actual
> length of the sub-directory that the backup shall store to, and shorten the
> filenames accordingly, instead of simply nominating 140 characters or some
> other magic limit, which in some cases could result in exceeding the length
> limit of 255.
It is a good idea, and I did think a lot of doing that too. But some do copy
the entire backup to another folder, maybe even to another disk or file system
(see Markus' comment #4 above). And if we use the longest paths possible, the
risk that the file names won't fit on another file system or even in another
folder increases. So I still think it is best to simply keep the file names
relatively short and not assume anything about the destination file system.
> BTW: Could the words Vojens, and Ebeltoft, by chance have any special
> significance for you? .. Just curious!
Well, I recognize them as towns here in Denmark, and my brain associates Vojens
with speedway and Ole Olsen, and Ebeltoft with the beautiful place near which I
have often arrived to Jutland (or left it) by car :o). But otherwise I have no
personal connection to these towns.
Regards,
Jesper
Original comment by jesper.h...@gmail.com
on 23 Nov 2013 at 6:09
Thanks for your detailed reply.
Re 1):
I see no harm in implementing a version that will play with Gmail only, as long as it is clearly marked as such. No need to continue maintaining the general IMAP version, but to leave that to others. I even tried to locate the officially latest version 20, but to no avail, so the original developers may already have dropped the ball.
So yes, if you are game to implement a version using Gmails internal IDs, then I'm ready to try it out, even if it takes more than 24 hours to backup my 74,000 messages.
Re 2):
I do not know what IMAP or Gmail reads in a filename, but if it were in my days, I would have implemented something that made each name unique right up front, and then maybe either use the very first part of the name as a look up key into separate database, or simply record the original complete filename somewhere in the files header.
Re BTW): You have a namesake in Ebeltoft, who was born in Vojens :) - - A UNI
year one mate of mine.
Original comment by ener...@gmail.com
on 23 Nov 2013 at 8:53
Original issue reported on code.google.com by
adriano....@gmail.com
on 30 Apr 2013 at 2:52