gographics / imagick

Go binding to ImageMagick's MagickWand C API
https://godoc.org/github.com/gographics/imagick/imagick
Other
1.75k stars 182 forks source link

Issues with ReadImageBlob hanging #96

Closed amlwwalker closed 7 years ago

amlwwalker commented 7 years ago

I've posted here, so would rather not repost, but wondered if anyone had any advice? http://stackoverflow.com/questions/39644215/imagemagick-go-api-http-hangs-on-readimageblob

Thanks

justinfx commented 7 years ago

Don't init and terminate ImageMagick in every request handler. Do it once for your whole app, as stated in the ImageMagick official API docs, and previous issues:

88 , #64

eriken commented 7 years ago

I am having a very similar problem. The API call to reading an image blob hangs indefinitely:

0x525cac gopkg.in/gographics/imagick.v2/imagick._Cfunc_MagickReadImageBlob+0x4c ??:0

0x549866 gopkg.in/gographics/imagick.v2/imagick.(*MagickWand).ReadImageBlob+0x166 /work/src/gopkg.in/gographics/imagick.v2/imagick/magick_wand_image.go:2202

I am only initiating imagick once in the main function and I let GC handle the destruction of MagickWands, I only create them with imagick.NewMagickWand()

The program processes images concurrently. But as far as I have read the MagickWand API is suppose to be thread safe?

What makes this very confusing is that it hangs randomly, it is not on certain images.

I have tried to compile with and without OpenMP, without any effect.

here is my docker file: https://gist.github.com/Eriken/21f2af55d9f86da85c350e5b4d6f8c41

Any help at all with be greatly appreciated.

justinfx commented 7 years ago

Are you tracking #64? Recent updates have suggested it is happening with a particular format, and within a docker image. Are you seeing this as well only for a particular format, vs another that may not ever hang? Because it makes me wonder if it is related to a particular delegate plugin.

As an example, in my own usage of these bindings, I only every generated jpg/png outputs and never saw a hang or crash like this. What formats are you using?

Are you able to use strace on the process to find out what it is doing at the point when you see it hanging? Do you have any debugger you can use, like gdb or delve to see what it is up to?

justinfx commented 7 years ago

Similar posted here http://imagemagick.org/discourse-server/viewtopic.php?t=23365

Apparently you can set MAGICK_DEBUG to trace and see debug output from ImageMagick. I wonder if it is hanging like in that link or if it's looping on something in the C library

eriken commented 7 years ago

This is a production environment that processes roughly 2-3 thousand images a minute on average. We process png and jpeg.

I have two different branches running at the moment; one where i let Go GC the magickwands and one where i destroy them explicitly. The former has not hung at all for almost a week now, while the latter has been restarted 3-4 times. I am not sure if this is a coincidence or not. Memory usage is also smaller on the former. I cannot really explain why this would have an effect but it is the current state none the less.

As far as delegates goes I think you have a valid point. I have spent some time making sure we do not feed IM any image that does not yield the correct mimetype(i.e jpeg and png). Before applying this sanity check I had huge memory issues due to some delegate kicking in for what was not an image file after all. I have also set the policy.xml with as few permissions as possible.

Regards Erik

eriken commented 7 years ago

Sadly, the hanging occurred on the former branch aswell. Debugging this is very difficult as reproducibility is almost impossible due to wands not hanging consistently on the same files.

Our distributed image processing service is deployed as docker containers there is no environment differences outside of the hardware.

justinfx commented 7 years ago

When these hangs occur, what information are you able to gather? Are you running debug output? Can you strace a hung process to see what it is doing or use gdb?

rogierlommers commented 7 years ago

I'm still working on this strange behaviour. Please see below the moment it crashes. The first piece is the output from our golang application. The second piece is the output of the strace -ffTv output from the process at the same time.

fatal error: unexpected signal during runtime execution
[signal 0xb code=0x1 addr=0x40 pc=0x7f13d82cd890]
runtime stack:
runtime.throw(0xaaefc0, 0x2a)
    /usr/local/go/src/runtime/panic.go:547 +0x90
runtime.sigpanic()
    /usr/local/go/src/runtime/sigpanic_unix.go:12 +0x5a
goroutine 97 [syscall, locked to thread]:
runtime.cgocall(0x8355f0, 0xc820138c80, 0xc800000000)
    /usr/local/go/src/runtime/cgocall.go:123 +0x11b fp=0xc820138c30 sp=0xc820138c00
stash.tools.bol.com/mir/escher/vendor/gopkg.in/gographics/imagick.v2/imagick._Cfunc_MagickReadImageBlob(0x7f13cc000b50, 0xc820932000, 0x1a7a, 0x0)
    ??:0 +0x41 fp=0xc820138c80 sp=0xc820138c30
stash.tools.bol.com/mir/escher/vendor/gopkg.in/gographics/imagick.v2/imagick.(*MagickWand).ReadImageBlob(0xc8201eb560, 0xc820932000, 0x1a7a, 0x1a7c, 0x0, 0x0)
[pid 29108] futex(0xc820020908, FUTEX_WAKE, 1 <unfinished ...>
[pid 29105] <... select resumed> )      = 0 (Timeout) <0.000632>
[pid 29108] <... futex resumed> )       = 1 <0.000090>
[pid 29106] <... futex resumed> )       = 0 <0.017200>
[pid 29105] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 29108] write(2, "time=\"2016-11-11T08:17:29-05:00\""..., 136 <unfinished ...>
[pid 29106] epoll_wait(5,  <unfinished ...>
[pid 29105] <... select resumed> )      = 0 (Timeout) <0.001861>
[pid 29108] <... write resumed> )       = 136 <0.000358>
[pid 29106] <... epoll_wait resumed> {}, 128, 0) = 0 <0.000352>
[pid 29105] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 29108] write(2, "time=\"2016-11-11T08:17:29-05:00\""..., 77 <unfinished ...>
[pid 29106] epoll_wait(5,  <unfinished ...>
[pid 29108] <... write resumed> )       = 77 <0.000103>
[pid 29105] <... select resumed> )      = 0 (Timeout) <0.000938>
[pid 29108] write(2, "time=\"2016-11-11T08:17:29-05:00\""..., 76 <unfinished ...>
[pid 29105] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 29108] <... write resumed> )       = 76 <0.000946>
[pid 29108] write(2, "time=\"2016-11-11T08:17:29-05:00\""..., 73 <unfinished ...>
[pid 29105] <... select resumed> )      = 0 (Timeout) <0.001043>
[pid 29108] <... write resumed> )       = 73 <0.000153>
[pid 29105] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 29108] write(2, "time=\"2016-11-11T08:17:29-05:00\""..., 98 <unfinished ...>
[pid 29105] <... select resumed> )      = 0 (Timeout) <0.000176>
[pid 29108] <... write resumed> )       = 98 <0.000245>
[pid 29105] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 29108] times({tms_utime=224, tms_stime=49, tms_cutime=0, tms_cstime=0}) = 1243115686 <0.000017>
[pid 29105] <... select resumed> )      = 0 (Timeout) <0.000263>
[pid 29108] times( <unfinished ...>
[pid 29105] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 29108] <... times resumed> {tms_utime=224, tms_stime=49, tms_cutime=0, tms_cstime=0}) = 1243115686 <0.000027>
[pid 29108] times( <unfinished ...>
[pid 29105] <... select resumed> )      = 0 (Timeout) <0.000133>
[pid 29108] <... times resumed> {tms_utime=225, tms_stime=49, tms_cutime=0, tms_cstime=0}) = 1243115686 <0.000023>
[pid 29105] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 29108] times({tms_utime=225, tms_stime=49, tms_cutime=0, tms_cstime=0}) = 1243115686 <0.000024>
[pid 29105] <... select resumed> )      = 0 (Timeout) <0.000099>
[pid 29108] times( <unfinished ...>
[pid 29105] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 29108] <... times resumed> {tms_utime=225, tms_stime=49, tms_cutime=0, tms_cstime=0}) = 1243115686 <0.000028>
[pid 29108] times({tms_utime=225, tms_stime=49, tms_cutime=0, tms_cstime=0}) = 1243115686 <0.000024>
[pid 29108] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x40} ---
[pid 29105] <... select resumed> )      = 0 (Timeout) <0.000471>
[pid 29108] rt_sigreturn( <unfinished ...>
[pid 29105] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 29108] <... rt_sigreturn resumed> ) = 139723207923856 <0.000037>
[pid 29108] write(2, "fatal error: ", 13) = 13 <0.000087>
[pid 29105] <... select resumed> )      = 0 (Timeout) <0.000164>
[pid 29108] write(2, "unexpected signal during runtime"..., 42 <unfinished ...>
[pid 29105] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 29108] <... write resumed> )       = 42 <0.000072>
[pid 29108] write(2, "\n", 1 <unfinished ...>
[pid 29105] <... select resumed> )      = 0 (Timeout) <0.000098>
[pid 29108] <... write resumed> )       = 1 <0.000114>
[pid 29105] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 29108] select(0, NULL, NULL, NULL, {0, 1000} <unfinished ...>
[pid 29105] <... select resumed> )      = 0 (Timeout) <0.000508>
[pid 29105] futex(0xef9f30, FUTEX_WAIT, 0, {60, 0} <unfinished ...>
[pid 29108] <... select resumed> )      = 0 (Timeout) <0.001170>
[pid 29108] select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) <0.001120>
[pid 29108] write(2, "[signal ", 8)     = 8 <0.000099>
[pid 29108] write(2, "0xb", 3)          = 3 <0.000114>
[pid 29108] write(2, " code=", 6)       = 6 <0.000138>
[pid 29108] write(2, "0x1", 3)          = 3 <0.000065>
[pid 29108] write(2, " addr=", 6)       = 6 <0.000114>
[pid 29108] write(2, "0x40", 4)         = 4 <0.000095>
[pid 29108] write(2, " pc=", 4)         = 4 <0.000125>
[pid 29108] write(2, "0x7f13d82cd890", 14) = 14 <0.000099>
[pid 29108] write(2, "]\n", 2)          = 2 <0.000227>
[pid 29108] write(2, "\nruntime stack:\n", 16) = 16 <0.000976>
[pid 29108] write(2, "runtime.throw", 13) = 13 <0.000440>
[pid 29108] write(2, "(", 1)            = 1 <0.000109>
[pid 29108] write(2, "0xaaefc0", 8)     = 8 <0.000120>
[pid 29108] write(2, ", ", 2)           = 2 <0.000209>
[pid 29108] write(2, "0x2a", 4)         = 4 <0.000129>
[pid 29108] write(2, ")\n", 2)          = 2 <0.000762>
[pid 29108] write(2, "\t", 1)           = 1 <0.000376>
[pid 29108] write(2, "/usr/local/go/src/runtime/panic."..., 34) = 34 <0.000154>
[pid 29108] write(2, ":", 1)            = 1 <0.000198>
[pid 29108] write(2, "547", 3)          = 3 <0.000162>
[pid 29108] write(2, " +", 2)           = 2 <0.000109>
[pid 29108] write(2, "0x90", 4)         = 4 <0.000122>
[pid 29108] write(2, "\n", 1)           = 1 <0.000181>
[pid 29108] write(2, "runtime.sigpanic", 16) = 16 <0.000736>
[pid 29108] write(2, "(", 1)            = 1 <0.000184>
[pid 29108] write(2, ")\n", 2)          = 2 <0.000629>
[pid 29108] write(2, "\t", 1)           = 1 <0.000204>
[pid 29108] write(2, "/usr/local/go/src/runtime/sigpan"..., 42) = 42 <0.000109>
[pid 29108] write(2, ":", 1)            = 1 <0.000106>
[pid 29108] write(2, "12", 2)           = 2 <0.000102>
[pid 29108] write(2, " +", 2)           = 2 <0.000108>
[pid 29108] write(2, "0x5a", 4)         = 4 <0.000173>
[pid 29108] write(2, "\n", 1)           = 1 <0.000187>
[pid 29108] write(2, "\n", 1)           = 1 <0.000795>
[pid 29108] write(2, "goroutine ", 10)  = 10 <0.000028>
[pid 29108] write(2, "97", 2)           = 2 <0.000133>
[pid 29108] write(2, " [", 2)           = 2 <0.000164>
[pid 29108] write(2, "syscall", 7)      = 7 <0.000121>
[pid 29108] write(2, ", locked to thread", 18) = 18 <0.000172>
[pid 29108] write(2, "]:\n", 3)         = 3 <0.000136>
[pid 29108] write(2, "runtime.cgocall", 15) = 15 <0.000707>
[pid 29108] write(2, "(", 1)            = 1 <0.000105>
[pid 29108] write(2, "0x8355f0", 8)     = 8 <0.000101>
[pid 29108] write(2, ", ", 2)           = 2 <0.000115>
[pid 29108] write(2, "0xc820138c80", 12) = 12 <0.000151>
[pid 29108] write(2, ", ", 2)           = 2 <0.000172>
[pid 29108] write(2, "0xc800000000", 12) = 12 <0.000122>
[pid 29108] write(2, ")\n", 2)          = 2 <0.000652>
[pid 29108] write(2, "\t", 1)           = 1 <0.000306>
[pid 29108] write(2, "/usr/local/go/src/runtime/cgocal"..., 36) = 36 <0.000162>
[pid 29108] write(2, ":", 1)            = 1 <0.000111>
rogierlommers commented 7 years ago

Now this indicates something isn't quiet thread-safe:

[pid 29108] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x40} --- <-- seg fault

Also please notice that in our case the crashes occur randomly. However I have a feeling that parallel working with PDF files and JPEG files reduces the time until crash.

justinfx commented 7 years ago

I wish we could see into the C layer of the crash. This output only manages to show it up to the cgo layer. I wonder if Delve or gdb could show where it really crashes within ImageMagick?

As for your implementation that has the problem, does it create wands per http request? Or is this boiled down to a single goroutine processing images serially? In my own usage of the bindings I had my image processing server using a fixed pool of workers that would fulfill image requests over a channel. I was going to ask if runtime.LockOSThread() in worker goroutines helps?

justinfx commented 7 years ago

Actually I guess LockOSThread wont help since this only happens on a single call into C

rogierlommers commented 7 years ago

Can this be related due to the fact that you're using unsafe package?: https://github.com/gographics/imagick/blob/master/imagick/magick_wand_image.go#L2202

I will get back on your other questions.

justinfx commented 7 years ago

You kind of need the unsafe package when using cgo. It's the only way to pass pointers into C or to allocated and free memory, or to create C strings and free them

This line you refer to is passing the address to the first element of the byte slice. The Go spec says that this will be the backing array for the whole slice https://golang.org/cmd/cgo/#hdr-Passing_pointers

Another example here https://www.google.co.nz/amp/s/amp.reddit.com/r/golang/comments/29w5ru/passing_arraysslices_as_pointers_to_c_functions/

Now whether ImageMagick is holding onto that pointer and trying to access it again later... I hope not.

rogierlommers commented 7 years ago

Please have a look how we use the magickwand. A request can have multiple operations which need to be applied to the original images. For example: StripAlphaChannel, MergeLayers, Format and Trim For each request, all operations are running in a series. So if StripAlphaChannel is finished, MergeLayers will start.

A typical operation looks like this:

This is a typical operation call:

err = j.Format(opt)
if err != nil {
    logrus.Error(err)
    return
}

This is how the operation "Format" looks like

func (j *jpeg) Format(opt RenderOptions) error {
    wand, err := j.toWand(opt.DebugInfo)
    defer j.doneWithMagickWand(wand)
    if err != nil {
        logrus.Errorf("Error while setting format: %s", err)
        return err
    }

    wand.SetImageFormat("JPEG")
    return nil
}

Get a wand for a specific operation

func (i *Image) toWand(debugInfo string) (w *imagick.MagickWand, err error) {
    w = imagick.NewMagickWand()
    err = w.ReadImageBlob(i.blob)
    if err != nil {
        logrus.Errorf("Error while reading blob to wand: %s", err)
        return
    }

    w.ResetIterator()
    return
}

When operation is finished, the defer() will call this

func (i *Image) doneWithMagickWand(w *imagick.MagickWand) {
    w.ResetIterator()
    i.blob = w.GetImagesBlob()
    w.Destroy()
    return
}
rogierlommers commented 7 years ago

I'm now working on a very small golang project to illustrate the problem, together with a Dockerfile so you can easily run it on your place.

justinfx commented 7 years ago

Cool. And are we still at that conclusion that it only happens in a Docker environment and not in a standard host environment?

rogierlommers commented 7 years ago

Well, not entirely sure. I haven't seen it crashing outside a docker container, but I'm also not convinced it is purely docker who is the one to blame here :). Anyhow, this Monday I will provide an update with an easy thing to reproduce.

eriken commented 7 years ago

Any updates here? I encountered an issue due to an old libjpeg library: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=768369

I had hopes this might have been related to the hanging, but sadly no.

eriken commented 7 years ago

In my former Dockerfile i used the package libmagickwand-dev. This is quite a bloated package that is packed with many outdated libraries. yet most MagickWand bindings recommend installing this package.

I made a new Dockerfile using Ubuntu yakkety: https://gist.github.com/Eriken/85b85cc4fdbe74e4369e994d20b00bc1

As I only decode jpeg and png, I only really need libpng and libjpeg. I assume MagickWand uses these when they read the blobs?

The imagemagick version that I compile have all the files I need to compile the bindings.

All our nodes are running this image now, and the bug I referred to in my former post is gone. Also I have had no hanging/stuck(ReadImageBlob) nodes within the last 48 hours during which we have processed roughly 5 million images.

rogierlommers commented 7 years ago

Thanks; useful information. May I ask why you are installing golang on the server where your binary is running on?

eriken commented 7 years ago

We have not set up a build server for go as of yet. The dockerfile i linked to is the base image for the service, then on my service image I build the binary by fetching the source files through a git tag. Works quite well.

How are you guys progressing?

rogierlommers commented 7 years ago

Well, our application crashes while running in docker container. Now I have a very simplified version of my app in a github repository, with the purpose to quickly reproduce the problem. Last thing before pushing to github (and posting here) is to have a quick look with a colleague of mine. So I'll keep you guys posted here.

rogierlommers commented 7 years ago

Well, I have created a reproduce thing. Here it is:

To reproduce the problem, first run a container and forward port 808:

docker run -p 8080:8080 rogierlommers/imagick-issue

output should be something like this:

rlommers@macbook-rlommers [~]$ docker run -p 8080:8080 rogierlommers/imagick-issue
time="2016-12-08T21:22:46Z" level=info msg="listening on port 8080"

Next thing is to run the python test script which can be found on the repository at github, it's in the "test" directory. Simply run: python test.py. This will fire request to the running webservice which is using imagick. You can find the sources also in the github repository mentioned above.

Now while this process is running, it is usefull to enter the running container. Use docker ps to find the CONTAINER ID and enter the container by typing docker exec -it e6e... /bin/bash. You'll enter a bash session inside the container. After a while, you'll notice that everyhting inside the containers becomes laggy and slow. The output of the top shows that gs has a deadlock with another gs instance, all spawned by imagemagick, which is spawned by imagick.

eventually-it-dies

I really hope you guys can help me with this.....

justinfx commented 7 years ago

I have a theory. I assume you have 8 cores in this machine and GOMAXPROCS is ending up set to 8 (by the count of the number of gs processes). I see 9 gs processes.

cgo calls consumes an OS thread. If ghostscript is taking a while and you get enough pdf requests, all 8 of your available os threads will be running. Then a 9th wants to come in and you lock up because there are no goroutines that are allowed to starts, while the others wait.

I had no idea how the pdf delegate worked until you showed me this, and I see it just basically relies on an external process to calling ghostscript. I also see that you are running unbounded amounts of image processes, if I am reading your code correctly. So you just stack up and try to queue as many image conversions as there are requests. Have you tried putting a semaphore around the actual ImageMagick processing? Maybe try setting it to GOMAXPROCS-1, with a min of 1? That way you can be sure that too many image processors never try and start.

Again..all a theory.

rogierlommers commented 7 years ago

Seems like GhostScript isn't thread-safe as well. Please see this: http://stackoverflow.com/questions/15079448/where-to-get-a-thread-safe-ghostscript-compilation. I have now added a limiter on the simultaneous handled requests. Please see my repo.

When I set amount of workers to 1, it still crashes and I can't figure out why? Since I'm handling one request at the time.

justinfx commented 7 years ago

I can repro this in your docker container. It takes well over 10 minutes before I see it.

I really am feeling like this is specific to GhostScript, as an external delegate in ImageMagick. And I don't know how thread safety in ghostscript would really play into this, since we are talking about external processes being launched by ImageMagick. The only way thread safety could come into play here would be at the ImageMagick layer which is responsibly for spawning the external gs process. If that delegate is using any kind of shared state, then that could be a thread safety issue but I havent tried looking at the postscript delegate code in ImageMagick.

Now, my previous theory about exhausting hardware threads still stands. Based on how cgo works, if the call into C (ImageMagick) ends up hanging for a while (while ImageMagick is blocked on io, waiting for the spawned gs process to finish) and you are running unbounded goroutines that are doing image processing, then you could hang up your app and not be able to service any more goroutines until something from ImageMagick finishes.

In your example you have wrapped a semaphore around the entire http request. I would suggest as a real solution to move that down to the level of where you are actually doing the image processes, such as around the Render(). That will let the rest of your operations still happen concurrently but just require a limit around image processing.

Also, it looks really inefficient to me to use the approach you have, where you constantly read the blob to a wand and back again for every type of image operation you want to do. I may be wrong in how I am understanding what ImageMagick is doing, but the ReadImageBlob that you do in each step is most likely triggering an external ghostscript process. So I would imagine your Render has to call gs 6 times. I haven't checked this though.

justinfx commented 7 years ago

I think I have narrowed down the problem. When I strace the Go app after it is in a broken state, I find that it is spitting out ENOMEM. I checked the used memory of the app and it had grown to the limit of the container, to where not enough could be allocated for spawning more ghostscript pdf processes. About 1.5G was in use on the Go app, and was starting to swap:

[pid    47] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fa431ffb9d0) = -1 ENOMEM (Cannot allocate memory) <0.008386>
[pid    47] rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x7fa43d3a8860}, {0x560c98d39910, ~[KILL STOP], SA_RESTORER|SA_STACK|SA_RESTART|SA_SIGINFO, 0x7fa43d3a8860}, 8) = 0 <0.000020>
[pid    47] rt_sigaction(SIGQUIT, {SIG_IGN, [], SA_RESTORER, 0x7fa43d3a8860}, {0x560c98d39910, ~[KILL STOP], SA_RESTORER|SA_STACK|SA_RESTART|SA_SIGINFO, 0x7fa43d3a8860}, 8) = 0 <0.000439>
[pid    47] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 <0.000252>
[pid    47] clone(child_stack=0, flags=CLONE_PARENT_SETTID|SIGCHLD, parent_tidptr=0x7fa431fed8fc) = -1 ENOMEM (Cannot allocate memory) <0.001421>

When I tried sending the jpeg payload instead, the app would peak/sit around 450mb and never fail, even after two runs. Also I don't think the jpeg delegate spawns processes, since it is all through libjpeg.

So, somewhere in either the user code of the app, imagick, or ImageMagick/ghostscript delegate there might be a memory leak?

justinfx commented 7 years ago

I've played a bit more with your source code, and after expanding all the image processing steps to be inline, within the Render() method, and getting rid of all the toWand()/doneWithMagickWand() logic, I am finding that the test runs without the app growing much more than 100mb, dropping back to 75mb and growing up again.

I would recommend you take a closer look at you current implementation and see if you can confirm the same. Something is causing a leak in the current approach, but it goes away when you read the blob once and do all the processing on the wand, and get the final blob at the end. Maybe you want to refactor your code to pass a wand to all your steps instead of each step creating a new wand and destroying it?

rogierlommers commented 7 years ago

I've played a bit more with your source code, and after expanding all the image processing steps to be inline, within the Render() method, and getting rid of all the toWand()/doneWithMagickWand() logic, I am finding that the test runs without the app growing much more than 100mb, dropping back to 75mb and growing up again. I would recommend you take a closer look at you current implementation and see if you can confirm the same. Something is causing a leak in the current approach, but it goes away when you read the blob once and do all the processing on the wand, and get the final blob at the end. Maybe you want to refactor your code to pass a wand to all your steps instead of each step creating a new wand and destroying it?

Well, it looks like we are finally stable. Let me explain our findings, so others can benefit as well. We had a couple of problems, not all related to each other.

  1. At our company, we use Redhat. Unfortunately RHEL comes with very old versions of imagemagick and it's dependencies. In our case, the service using Gographics is used for converting PDF files to jpeg files. Now this all started by the fact that we saw segmentation faults during the convert of pdf-->jpeg. Some investigation learned us that the PDF we were trying to convert contained a PNG. So what happens was that gographic was running imagemagick, which delegates into libpng, but sice this version (libpng 1.5.13) of libpng was not thread safe. Everything crashed/paniced and our complete binary crashed.
  2. After upgrading to a newer version (1.6.26), this problem has been solved. So far so good.
  3. But then we had GhostScript creating deadlocks. See my post here. Then @justinfx gave us some tips of how to optimise this. In detail: each request for a rendition resulted in a couple of operations. F.e. a request typically contains a scale(), format(), mergeLayers(), etc. Each of these operations were performing readBytesToWand, doOperation() and closeWand() (in this order). This resulted in lots of delegations to the GS process, which resulted in some deadlocks. This was solved by simply removing this logic: now we create one magickwand, do all the image processing and then close the magickwand.
  4. Next problem was to make everything work in the distribution (RHEL) we use. I turned out that when you compile imagemagick yourself, you need to explicitly tell the compiler to ignore compiling agains gslib. See snipplet below. This flag will prevent ImageMagick to use both gs and gslib.
  5. We introduced a rate limitter. Now we only accept n-requests by introducing a worker-pattern.
    tar xf ImageMagick-6.8.9-10.tar.xz && \
    cd ImageMagick-6.8.9-10 && \
        CXXFLAGS="-Wformat -Wformat-security -Wall -Wuninitialized -ffast-math -m64 -Ofast -funroll-loops -g -pthread" \
        ./configure --prefix=/usr --without-x --without-modules --without-gslib --with-magick_plus_plus --with-perl --with-webp && \
        make -j"$(nproc)" && \
        make install

So, two conclusions:

justinfx commented 7 years ago

Sounds good. I guess the moral of the story is to use updated libs and build ImageMagick from source if you have an old version in your package manager :-)

sarovin commented 4 years ago

I've played a bit more with docker alpine and this my conclusion:

This not work:

FROM alpine:latest

ARG MAGICK_VERSION=7.0.9-19

RUN apk --update add jpeg-dev libpng-dev librsvg-dev
RUN apk add --no-cache --virtual build-dependencies autoconf wget g++ gcc make tar file

RUN cd /tmp && wget https://github.com/ImageMagick/ImageMagick/archive/${MAGICK_VERSION}.tar.gz -O imagemagick.tar.gz && \
  tar -zxf imagemagick.tar.gz && \
  cd ImageMagick-${MAGICK_VERSION} && \
  ./configure \
      --without-magick-plus-plus \
      --without-perl \
      --without-jbig \
      --without-lcms \
      --without-tiff \
      --without-wmf \
      --without-xml \
      --without-zlib \
      --without-x \
      --with-gvc=no \
      --disable-openmp \
      --disable-docs && \
  make -j$(nproc) && make install && \
  ldconfig /usr/local/lib && \
  cd .. && rm -f imagemagick.tar.gz && rm -rf ImageMagick-${MAGICK_VERSION}

RUN apk del build-dependencies

This work fine:

FROM alpine:latest

ARG MAGICK_VERSION=7.0.9-19

RUN apk --update add jpeg-dev libpng-dev librsvg-dev
RUN apk add --no-cache --virtual build-dependencies autoconf wget g++ gcc make tar file

RUN cd /tmp && wget https://github.com/ImageMagick/ImageMagick/archive/${MAGICK_VERSION}.tar.gz -O imagemagick.tar.gz && \
  tar -zxf imagemagick.tar.gz && \
  cd ImageMagick-${MAGICK_VERSION} && \
  ./configure \
      --without-magick-plus-plus \
      --without-perl \
      --without-jbig \
      --without-lcms \
      --without-tiff \
      --without-wmf \
      --without-xml \
      --without-zlib \
      --without-x \
      --with-gvc=no \
      --disable-docs && \
  make -j$(nproc) && make install && \
  ldconfig /usr/local/lib && \
  cd .. && rm -f imagemagick.tar.gz && rm -rf ImageMagick-${MAGICK_VERSION}

RUN apk del build-dependencies

I have tried with this code:

...

err = mw.ThumbnailImage(w, h)
if err != nil {
  fmt.Println("IMAGICK THUMB FILE", err.Error())
  return err
}

err = mw.SetImageCompressionQuality(100)
if err != nil {
  fmt.Println("COMPRESSION FILE", err.Error())
  return err
}

err = mw.SetFormat(fileObj.MimeType)
if err != nil {
  fmt.Println("FORMAT FILE", err.Error())
  return err
}

err := mw.WriteImagesFile(inputFile) // ISSUES HANGING
if err != nil {
  fmt.Println("WRITE FILE", err.Error())
  return err
}

...

The problem is --disable-openmp on the alpine image.

I have used this image: https://user-images.githubusercontent.com/5980398/73350097-b3f11f80-428c-11ea-8caa-960ee7a9af1c.png

justinfx commented 4 years ago

There have been other issues reported that are associated with using an Alpine docker image and the musl compiler (like #168) althought you are installing gcc. I wonder if it is using glibc or musl libc at runtime? I use an ubuntu image for the travis tests.

sarovin commented 4 years ago

@justinfx

I wonder if it is using glibc or musl libc at runtime?

How can I check this? For now, I switched to debian but the container image is bigger than alpine image, I want find a solution with alpine.

justinfx commented 4 years ago

Running the ldd command against the compiled binary file would tell you what the runtime linker would be picking up.

sarovin commented 4 years ago

@justinfx This the result:

Debian

ldd /usr/local/bin/magick
  linux-vdso.so.1 (0x00007ffe08bf9000)
  libMagickCore-7.Q16HDRI.so.7 => /usr/local/lib/libMagickCore-7.Q16HDRI.so.7 (0x00007f48f178c000)
  libMagickWand-7.Q16HDRI.so.7 => /usr/local/lib/libMagickWand-7.Q16HDRI.so.7 (0x00007f48f1681000)
  libjpeg.so.62 => /usr/lib/x86_64-linux-gnu/libjpeg.so.62 (0x00007f48f1418000)
  libpng16.so.16 => /usr/lib/x86_64-linux-gnu/libpng16.so.16 (0x00007f48f13df000)
  libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f48f11c1000)
  libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f48f103e000)
  libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f48f101b000)
  libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f48f0e5a000)
  /lib64/ld-linux-x86-64.so.2 (0x00007f48f1b0c000)

Alpine

ldd /usr/local/bin/magick
  /lib/ld-musl-x86_64.so.1 (0x7f8423694000)
  libMagickCore-7.Q16HDRI.so.7 => /usr/local/lib/libMagickCore-7.Q16HDRI.so.7 (0x7f84232fa000)
  libMagickWand-7.Q16HDRI.so.7 => /usr/local/lib/libMagickWand-7.Q16HDRI.so.7 (0x7f84231f2000)
  libc.musl-x86_64.so.1 => /lib/ld-musl-x86_64.so.1 (0x7f8423694000)
  libjpeg.so.8 => /usr/lib/libjpeg.so.8 (0x7f842315a000)
  libpng16.so.16 => /usr/lib/libpng16.so.16 (0x7f842312a000)
  libz.so.1 => /lib/libz.so.1 (0x7f8423110000)
  libgomp.so.1 => /usr/lib/libgomp.so.1 (0x7f84230e5000)
  libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x7f84230d1000)

Alpine with --disable-openmp

ldd /usr/local/bin/magick
  /lib/ld-musl-x86_64.so.1 (0x7f986871b000)
  libMagickCore-7.Q16HDRI.so.7 => /usr/local/lib/libMagickCore-7.Q16HDRI.so.7 (0x7f986839b000)
  libMagickWand-7.Q16HDRI.so.7 => /usr/local/lib/libMagickWand-7.Q16HDRI.so.7 (0x7f9868294000)
  libc.musl-x86_64.so.1 => /lib/ld-musl-x86_64.so.1 (0x7f986871b000)
  libjpeg.so.8 => /usr/lib/libjpeg.so.8 (0x7f98681fc000)
  libpng16.so.16 => /usr/lib/libpng16.so.16 (0x7f98681cc000)
  libz.so.1 => /lib/libz.so.1 (0x7f98681b2000)
justinfx commented 4 years ago

OK so it's using musl libc runtime on alpine which has proven to be hit and miss with ImageMagick C libs. Guess you found some compiler flags to work around your current hang.