vsespb / mt-aws-glacier

Perl Multithreaded Multipart sync to Amazon Glacier
http://mt-aws.com/
GNU General Public License v3.0
536 stars 57 forks source link

Cleaning up duplicate files #95

Closed smcgivern closed 9 years ago

smcgivern commented 9 years ago

I'm not sure how, but my vault has a lot of duplicate files (not all files have been duplicated, and not all have been duplicated the same amount of times):

$ cat Music.journal | cut -f 5,6,7,8 | sort | uniq -c | sort -nr | head -n 1
      7 43807793    1214763861  40206cb83dfdf929900013b839d69c2618b68fa6c47716bd65ec28b6176fcf8e    Björk/Volta/Björk-02-Wanderlust.flac

So all seven of these files in the archive have the same size, mtime, treehash, and filename. I don't see a way to delete files by archive ID, but if I do that manually and remove the entries from the journal, is that a safe operation?

(I won't be deleting the files for a little while as it's still < 3 months since I put them there.)

vsespb commented 9 years ago

but if I do that manually and remove the entries from the journal, is that a safe operation?

yes. you can extract files-to-delete to new journal and issue purge-vault with that new journal. then you can wait 24h, retrieve-inventory+download-inventory https://github.com/vsespb/mt-aws-glacier#restoring-journal

I am not sure why you have duplicates - some make sure those are real duplicates on server side - i.e. they have different archive id, and that you keep at least one archive for each filename.

my vault has a lot of duplicate files

Possible ways to run into this: 1) Use always-positive

2) Drop your journal file, then start backup to new journal. Then download-inventory. Repeat 7 times.

If you could remember in details how you worked with journal I can try to investigate and find why this happened?

smcgivern commented 9 years ago

These definitely appear to be duplicates on the server side, I retrieved a new journal file to check - and the vault size is bigger than expected (although not so big that it's really expensive, just annoying).

I'm sorry that I don't have more details on how this happened - I only noticed once I got my bill. My best guess at the moment is that I set up cron wrongly, and I was running multiple mtglacier instances against the same journal at once. The duplicate files are close to each other in the journal file, which might indicate that (I don't know if mtglacier locks the journal file).

Thanks so much for the quick response, let me know if there's anything else you want me to do to help debug this.

$ grep -n 'Björk/Volta/Björk-02-Wanderlust.flac' Music.journal
10007:B 1410599059  CREATED 0b7GVI-Lxjp_jVyYiMizZZb084aLN-uktjP6IbmG7iLlJxd289C6CHfKWwEP8IF_TDFbHI7KWZPr24paLrKPTzAIH8qYzAUcpDTjanSNBAxhjfcNbst4zMsPPm2edP3i5AODZibJBQ  43807793    1214763861  40206cb83dfdf929900013b839d69c2618b68fa6c47716bd65ec28b6176fcf8e    Björk/Volta/Björk-02-Wanderlust.flac
10016:B 1410599806  CREATED QkwdbLDjEeaS8D32x1wk0WIBcXfuSKd6kizW7rNEuAzu85f-eNO52XXQqS7i98RRNB8sDLRoLEnFmwpZ6d9NnYKR-JyfbxciUYbpcW-HKBLrdAtMtrPUtNNCoHJdjpq11L8s__ONNQ  43807793    1214763861  40206cb83dfdf929900013b839d69c2618b68fa6c47716bd65ec28b6176fcf8e    Björk/Volta/Björk-02-Wanderlust.flac
10018:B 1410600114  CREATED o3see3QYM4ikqWFEpgZyRT7T0OYAemlk2sxJvCIn7fJtxxGkNwNivg_G30_m4WXF1i81vZoNJzV0uKb_m1INOC42jQM-pfJ1lx17tdMpNol5b6qbqIRjGdiHX5U-O-h1zYA4CyGpdg  43807793    1214763861  40206cb83dfdf929900013b839d69c2618b68fa6c47716bd65ec28b6176fcf8e    Björk/Volta/Björk-02-Wanderlust.flac
10023:B 1410600344  CREATED ir7_cl3xLY3NhkLbCx6Ei69lg28kqj6cQeDcrTwQQnqKz-1i2n1aDJ78H8rCnm08g_fDiSD4kjxiL7GfT8sh5RKKt20t9-bopY1g4lMDHn_NrXCtVF6PbK15-hrJi6DtngeafNHUsA  43807793    1214763861  40206cb83dfdf929900013b839d69c2618b68fa6c47716bd65ec28b6176fcf8e    Björk/Volta/Björk-02-Wanderlust.flac
10033:B 1410601229  CREATED PVYLZscYQi_5mbPH-xDNklbXzXLHDIG3-SeONzUcGJioychyWWhoidvScZUtVTi3t0cuE-Q79TjQXGjM39TqCjvvt66iXl4ohdeEOQkdk8g3m4Q1KvYiwwGZ8aMCtTflX_MppnokoA  43807793    1214763861  40206cb83dfdf929900013b839d69c2618b68fa6c47716bd65ec28b6176fcf8e    Björk/Volta/Björk-02-Wanderlust.flac
10045:B 1410602363  CREATED Ckp7h9pP3oyM-cZLGfdXGyQ9-wY6U0qnDCDhXx0C2nwpvFEsY8vXFJjaLm3XyugENdfkuYWVj1w95Hab3LIKCBoOYK3jKk4ALekYMgS7z5Ez1WcmZpF6hlU0esD1nDFPtKsL67HZ8A  43807793    1214763861  40206cb83dfdf929900013b839d69c2618b68fa6c47716bd65ec28b6176fcf8e    Björk/Volta/Björk-02-Wanderlust.flac
10073:B 1410604494  CREATED NbrmfrGWwPkV_lS9_nFPbuzahskcvmoTvkoTGWgAq_K1pqu88SignldoWR-4hEj_m0L1LKJpWJNJjE0i3MkXVZzjR_tNPfS2-9LWa-Zv6gsJL3FLTdI7uHgJj5n2H6RYAAhOzXajXA  43807793    1214763861  40206cb83dfdf929900013b839d69c2618b68fa6c47716bd65ec28b6176fcf8e    Björk/Volta/Björk-02-Wanderlust.flac
vsespb commented 9 years ago

I was running multiple mtglacier instances against the same journal at once

yes, that could be a reason.

I don't know if mtglacier locks the journal file

no :( I opened issue #96 - enhancement. meantime you could use flock if run in situation when yo're not sure if concurrent processes run against same journal ( I use flock for my own backups using mtglacier)

The duplicate files are close to each other in the journal file

1410604494 - 1410599059 is 90 minutes range. That could be true if whole backup process is longer that 90 minutes and you've started 7 mtglaciers at a time.

let me know if there's anything else you want me to do to help debug this

probably no, concurrent access to journal explains this.

smcgivern commented 9 years ago

Thanks! I'll be more careful in future :smiley: