saimn / sigal

yet another simple static gallery generator
http://sigal.saimon.org/
MIT License
878 stars 169 forks source link

[feature request] multiprocessing when collecting albums #353

Open thorsummoner opened 5 years ago

thorsummoner commented 5 years ago

I notice the collecting albums phase is done with a single threaded for os.walk() read every file loop; and on systems with a single hard drive that's likely a reasonable thing to do, though on systems that can benefit from threaded reads, this is something of a painpoint/bottleneck (for me, waiting to have the gallery generated in minutes, not hours)

I'm willing to work on a codechange, would such a code change be welcome?

Edit: a deferred (lazy loaded) file read would be a possibly excellent option, if the metadata acquired by the file read is only used in already-multiprocessing section of code, ~I did not evaluate very much of the code to make an informed architecture proposal like that~

Edit2: Looks like a similar read-all-the-files loop happens at the end of the multiprocessing phase too, I might suggest that all operations performed with the image data are done at once and within multiprocessing; as an aside, I am stumped on a easy way to execute sigal for only sub-portions of the content (given I am aware only of a negative exclude filter, seems like i'd have to split up the work into prefixes before executing sigal in parallel would help, and I'm just not eager to do that, if I had an include based filter I would be happy to ls | parallel sigal this for my own special needs, just not sure thats an option without writing a wrapper to generate a config, prepare a prefix, do the work, and reconcile the result afterward

saimn commented 5 years ago

I never thought that the collecting albums step would be a bottleneck, but sigal was build for small/medium galleries, and as I'm quite happy with the current performance so I never really tried to optimize more. Do you have some numbers to get an idea of the performance in your case?

Also, as far as I remember, the collecting albums step is not reading the images. Almost all the processing is done in the multiprocessing loop (resizing images, creating thumbnails), and then the writing step can trigger some reads if some information is missing depending on the theme and settings.

So I certainly welcome efforts to improve the performance, though this must be balanced with the code complexity and speedup gain.

thorsummoner commented 5 years ago

i totally respect the scope of a small to medium gallery, i don't want to push anything outside of the ambitions of this project. my frustrations led me to try profiling the code and all that really suggested is that the master thread which is the only one not sleeping was spending all it's time reading from disk. I'll admit that system stats when i collected them show 100% active time on the virtual disk, I'm going to try to identify more about that, it's a network volume so it's possible that moving the workload to a system with bear metal oips might help me a bit

meanwhile, i can say with confidence that gnu find can list the directory almost instantly, so there must be something more than just filesystem enumeration at play. by the strace logs the main thread was opening every file, reading 4096 bytes, and continuing to the next file. (as determined by the heuristics of me making educated guesses)

this leads me to suspect that something about the image library under your project is collecting data from the head of each file kind of thing; which is why i hypothetisize making the gallery collection mutiprocessed work could be useful for me.

i will play around some more, if i do propose any changes, I'll make an effort to gate any specialized code i write to require intentional activation via the argument parser. I find myself preferring a different code style, which makes me hesitant to get my hands dirty, i will also try to be a good contributor with adherence to make-like like-like for any changes proposed. i would really prefer not to make a change set drastic enough to require my own fork due to internation concerns, but i blather on

Thanks for the feedback! and thanks for the welcoming words!

On Nov 16, 2018 3:50 PM, "Simon Conseil" notifications@github.com wrote:

I never thought that the collecting albums step would be a bottleneck, but sigal was build for small/medium galleries, and as I'm quite happy with the current performance so I never really tried to optimize more. Do you have some numbers to get an idea of the performance in your case?

Also, as far as I remember, the collecting albums step is not reading the images. Almost all the processing is done in the multiprocessing loop (resizing images, creating thumbnails), and then the writing step can trigger some reads if some information is missing depending on the theme and settings.

So I certainly welcome efforts to improve the performance, though this must be balanced with the code complexity and speedup gain.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/saimn/sigal/issues/353#issuecomment-439563059, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQlkLPx9vZ-tpv4sCw3zlaV9ZOD4VO3ks5uv09AgaJpZM4YfY2E .

thorsummoner commented 5 years ago

just a note from the last successful run for me

Done. Processed 1 images (434390 skipped) and 0 videos in 9990.87 seconds (~3 hours)

time find Pictures/ >/dev/null real 0m0.422s user 0m0.188s sys 0m0.232s

On Fri, Nov 16, 2018, 5:01 PM Thorsummoner0 <thorsummoner0@gmail.com wrote:

i totally respect the scope of a small to medium gallery, i don't want to push anything outside of the ambitions of this project. my frustrations led me to try profiling the code and all that really suggested is that the master thread which is the only one not sleeping was spending all it's time reading from disk. I'll admit that system stats when i collected them show 100% active time on the virtual disk, I'm going to try to identify more about that, it's a network volume so it's possible that moving the workload to a system with bear metal oips might help me a bit

meanwhile, i can say with confidence that gnu find can list the directory almost instantly, so there must be something more than just filesystem enumeration at play. by the strace logs the main thread was opening every file, reading 4096 bytes, and continuing to the next file. (as determined by the heuristics of me making educated guesses)

this leads me to suspect that something about the image library under your project is collecting data from the head of each file kind of thing; which is why i hypothetisize making the gallery collection mutiprocessed work could be useful for me.

i will play around some more, if i do propose any changes, I'll make an effort to gate any specialized code i write to require intentional activation via the argument parser. I find myself preferring a different code style, which makes me hesitant to get my hands dirty, i will also try to be a good contributor with adherence to make-like like-like for any changes proposed. i would really prefer not to make a change set drastic enough to require my own fork due to internation concerns, but i blather on

Thanks for the feedback! and thanks for the welcoming words!

On Nov 16, 2018 3:50 PM, "Simon Conseil" notifications@github.com wrote:

I never thought that the collecting albums step would be a bottleneck, but sigal was build for small/medium galleries, and as I'm quite happy with the current performance so I never really tried to optimize more. Do you have some numbers to get an idea of the performance in your case?

Also, as far as I remember, the collecting albums step is not reading the images. Almost all the processing is done in the multiprocessing loop (resizing images, creating thumbnails), and then the writing step can trigger some reads if some information is missing depending on the theme and settings.

So I certainly welcome efforts to improve the performance, though this must be balanced with the code complexity and speedup gain.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/saimn/sigal/issues/353#issuecomment-439563059, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQlkLPx9vZ-tpv4sCw3zlaV9ZOD4VO3ks5uv09AgaJpZM4YfY2E .

Glandos commented 5 years ago

I met this issue. The collection of metadata is heavy. See https://github.com/saimn/sigal/blob/master/sigal/gallery.py#L192

@thorsummoner Try to put a return just before the try block. You will miss every metadata, but it should be way faster.

If you have a .md file for each media, you shouldn't experienced this. But otherwise, it reads IPTC data, and Pillow read the entire image in memory to do that, which is really expensive.

I tried to find a lightweight parser, but didn't find any for now. Maybe sigal should provide a switch for skipping IPTC data in the meantime.

Note that the performance hit could also be caused by something else…

saimn commented 5 years ago

Good catch @Glandos , I forgot about this change. We could probably just add a setting to deactivate this.