mholt / photobak

Back up your content from Google Photos - DEPRECATED: use Timeliner
https://github.com/mholt/timeliner
309 stars 29 forks source link

OOM killed due to memory leak with NEF/TIFF files #2

Closed mpx closed 5 years ago

mpx commented 7 years ago

I've found photobak is regularly OOM killed by the Linux kernel:

kernel: Out of memory: Kill process 9042 (photobak) score 695 or sacrifice child kernel: Killed process 9042 (photobak) total-vm:30599888kB, anon-rss:13104776kB, file-rss:0kB, shmem-rss:0kB kernel: oom_reaper: reaped process 9042 (photobak), now anon-rss:0kB, file-rss:4kB, shmem-rss:0kB

Above instance was built with go1.8rc1. Retested with go1.7.4:

RSS doesn't grow while it rescans already downloaded folders, but it appears to increase at roughly the size of each photo downloaded. I haven't waited for it to get OOM killed again, but it seems likely since it's up to 1.4G RSS and growing..

mholt commented 7 years ago

What is your command when you run photobak? What kind of device are you running it on? I've watched memory very carefully and haven't seen/had any problems.

mpx commented 7 years ago

photobak -googlephotos USERNAME@gmail.com -repo DIR -v

I spoke too soon.. The rerun with go1.7.4 peaked at 1.4G, the growth stopped and it successfully downloaded the remainder of the photos (~6GB, mostly the final "Auto Backup" album).

That said, 1.4GB RSS feels a bit high for 5 concurrent download sessions.

I'll try again with a fresh folders for go1.8rc1 and go1.7.4 to see if there is a difference.

mholt commented 7 years ago

Interesting, I never saw more than about 50-100 MB usage, even for my entire photo library (~16 GB) which was downloaded in a single run and with 20 concurrent downloads. How big is your library in total (either item count or size on disk)? Thanks for helping look into this.

Anyway. Each item is streamed directly to disk, I think it is actually quite memory-efficient (I would expect -prune to use more memory, however)...

mpx commented 7 years ago

The execution above that completed successfully (no OOM) resulted in a 57GB archive on disk. I restarted it, and it found more photos to download. I probably have ~5GB of photos from Auto Backup, and potentially up to 300G more of uploads from Google Drive. I'm not sure if the Picasa API will find all the uploaded photos, or just the ones that Google Photos has yanked into automatic albums.

The next run on a fresh repo with go1.7.4 resulted in an OOM:

[100977.545441] Out of memory: Kill process 30812 (photobak) score 664 or sacrifice child [100977.545453] Killed process 30812 (photobak) total-vm:29462316kB, anon-rss:13448096kB, file-rss:0kB, shmem-rss:0kB [100978.377399] oom_reaper: reaped process 30812 (photobak), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Command terminated by signal 9 Command being timed: "photobak -googlephotos USER@gmail.com -repo DIR -v" User time (seconds): 126.74 System time (seconds): 76.51 Percent of CPU this job got: 7% Elapsed (wall clock) time (h:mm:ss or m:ss): 45:06.84 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 13477016 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 42420 Minor (reclaiming a frame) page faults: 5631370 Voluntary context switches: 14437963 Involuntary context switches: 1655684 Swaps: 0 File system inputs: 1875664 File system outputs: 46509672 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0

I had a brief read through the code, nothing immediately stood out that would eat memory in photobak. I'd probably look at the interactions with "exif" next. If exif.Decode doesn't return the memory (and goroutine) will be pinned. I have a lot of NEFs that would decode as TIFF within exif.Decode. This might be a different code path compared to your experience? This would also explain why memory didn't increase when I hit "Auto Backup".

mholt commented 7 years ago

@mpx Thanks, this is immensely helpful. Indeed, I don't have NEF files. Dang, I would also suspect the exif package initially. Are you familiar with building Go programs then? Can you run a memory profile (doesn't have to be for all your GBs of data, just enough to see memory going up)? I'm just swamped with some other things right now but if you want to get to the bottom of this, it would really help. :+1: Once we know where the leak is I can devote some minutes/hours to fixing it.

mpx commented 7 years ago

Just had a look at the exif code:

exif.Decode has 2 main paths which can read the entire stream into memory (until EOF): 1) TIFF: ioutil.ReadAll before processing further 2) non-TIFF: scans until it reaches JPEG APP1 header

The call to exif.Decode processes an io.Pipe reader, and it looks like the pipe reader is never closed before exif.Decode returns. I suspect: 1) TIFF files will block indefinitely unless they generate an error 2) non-TIFF files that don't contain something resembling an APP1 header will block as well

Perhaps large video files are an extreme case of (2)?

Also, it looks like there may be a race in downloadAndSaveItem(). "var x *exif.Exif" is used across 2 goroutines (x, _ = exif.Decode and main path) without any explicit synchronisation - perhaps it occurs implicitly (I haven't thought it through/tried the race detector).

The following profiles show the goroutine/memory leaks: photobak-goroutines.pdf photobak-inuse-space.pdf

mholt commented 7 years ago

@mpx Thanks for the profiles! Indeed the leak occurs in exif.Decode.

I have run photobak with the race detector dozens of times and never found a race condition; the synchronization on x is at least implicit, because of the blocking I/O.

Unfortunately I have a critical issue in another project occupying my time right at the moment. This one is next on my list, but if you want to look into a fix before I get around to it, I'd be happy and able to review a pull request!

(Note: That exif library seems abandoned; if you find a fix by altering that library, we can look into vendoring it.)

mholt commented 7 years ago

It could be that the buffer allocated in the exif library could be obtained from a pool using sync.Pool, maybe that would improve performance?

mholt commented 5 years ago

This project has been deprecated in favor of Timeliner. Thank you!