piotrmurach / tty-command

Execute shell commands with pretty output logging and capture stdout, stderr and exit status.
https://ttytoolkit.org
MIT License
400 stars 34 forks source link

fatal: No live threads left. Deadlock? #37

Closed jrochkind closed 6 years ago

jrochkind commented 6 years ago

Does TTY::Command use threads in some way? It looks like it does, and as odd as it seems, I think it may be responsible for some very strange and hard to reproduce fatal: No live threads left. Deadlock? exceptions I've started getting in my app since introducing TTY::Command.

Could it be causing a deadlock somehow? Any ideas?

The backtrace looks like:

[GEM_ROOT]/gems/tty-command-0.6.0/lib/tty/command/process_runner.rb:128 :in `join`
 [GEM_ROOT]/gems/tty-command-0.6.0/lib/tty/command/process_runner.rb:128 :in `block in read_streams`
 [GEM_ROOT]/gems/tty-command-0.6.0/lib/tty/command/process_runner.rb:127 :in `each`
 [GEM_ROOT]/gems/tty-command-0.6.0/lib/tty/command/process_runner.rb:127 :in `read_streams`
 [GEM_ROOT]/gems/tty-command-0.6.0/lib/tty/command/process_runner.rb:45 :in `run!`
 [GEM_ROOT]/gems/tty-command-0.6.0/lib/tty/command.rb:174 :in `block in execute_command`
 [GEM_ROOT]/gems/tty-command-0.6.0/lib/tty/command.rb:174 :in `synchronize`
 [GEM_ROOT]/gems/tty-command-0.6.0/lib/tty/command.rb:174 :in `execute_command`
 [GEM_ROOT]/gems/tty-command-0.6.0/lib/tty/command.rb:94 :in `run`
[PROJECT_ROOT]/app/services/chf/create_dzi_service.rb:123 :in `block in create_dzi!`
jrochkind commented 6 years ago

All my TTY::Commands are created with TTY::Command.new(printer: :null), in case that could be related. I think there is somehow a deadlock on those threads having to do with stdout/stderr/stdin.

piotrmurach commented 6 years ago

Hey Jonathan!

Thanks for reporting this. Correct, reading both the stdin and stdout output from child process is performed in threads. However, recently I've made a change to synchronize data writing which I would bank my money on is causing the issue. Do you have any examples I could run to reproduce the issue? Or do you fancy removing the lock and seeing if that fixes the issue?

jrochkind commented 6 years ago

I don't have any examples, because I can't reproduce it on demand myself. It happens non-predictably, I've seen it from both places I am using TTY::Command, I am not doing anything special with it.

I think you're probably right that change is doing it somehow though. I do know that dealing with deadlocks on stderr/stdout/stdin can be very tricky, because of the way unix does buffering of them. Unix will only let a process write so many bytes to stdout/stderr buffer before the buffer is read somewhere, if it fills up unix actually stops the process until it has room to write more to stdout/stderr. Don't know if something related to that is going on here, I was hoping I wouldn't have to worry about such things and could just leave them to TTY::Command. :)

I suspect other people are gonna run into this too though. I can't reproduce it on demand, but I've seen it in both of the only places I am using TTY::Command, i don't think there's anything special about the invocations, one to GraphicsMagick gm convert command line and one to vips command line. It is very mysterious, and not immediately even obviously attributable to TTY::Command.

jrochkind commented 6 years ago

Oh, I am using TTY::Command concurrently from multiple threads at once!

I wonder if that is related. I do not share TTY::Command instances between threads (at least not on purpose, I'll double check), but create a new TTY::Command.new instance in each thread. But I wonder if that synchronize lock is not actually reliably threadsafe when you have multiple TTY::Commands in multiple threads?

What version of TTY::Command should I downgrade to to avoid that lock?

I won't immediately know if it's fixed it though, it'll be like, wait a week, if I haven't seen the error happen again it probably fixed it, unless it's just luck I haven't run into it yet. Concurrency!

piotrmurach commented 6 years ago

Reproducing deadlocks is hard but I quite often I get to create some examples that if run 1000 times have high probably of triggering the problem. The closer the example to real scenario the better.

I will have a look into this over the weekend and release new version if I fix it - with a level of confidence. I don't think this lock is necessary as both threads don't share any data items. If you wish you can downgrade to v0.4.0 but this version is significantly different. I would prefer to solve the issue - let's have some confidence in the future of tty-comand :0

Yeah dealing with processes is interesting and I hope that this library can capture the best practice and provide a solid solution. That's why I haven't released v1.0 :)

jrochkind commented 6 years ago

Problem is I've got this in production, and need to get rid of the errors! Although maybe I can just wrap it in a rescue and retry, since it probably won't happen again when repeated.

I honestly don't know how to/have time to extract my code into a reproduction, but it's open source. Here's one place I got two fatal: No live threads left. Deadlock? from today:

https://github.com/chemheritage/chf-sufia/blob/master/app/services/chf/create_dzi_service.rb#L123

That's run in an ActiveJob, using resque.

And actually, here's the complete backtrace from honeybadger: https://app.honeybadger.io/fault/53196/d75cdd5f5a0c789ccfdb403da266db10

Hope that helps.

jrochkind commented 6 years ago

Ah, I see you removed the locks. Any chance for a release with that commit? I continue to get the deadlock error occasionally, about once a week. I need to either upgrade to a version that won't have it, remove TTY-command, or wrap in a rescue/retry.

piotrmurach commented 6 years ago

As as side effect of tty components getting more popular, I get sidetracked by important issues. As much as I wanted to finish work on this last week, I needed to fix tty-screen that is used by kontena.io in their CLI tool among other tty components. I also found out that fastlane is using this component as well. So I felt I needed to prioritise that component over this. I feel super stoked that people are using tty libs more and more! (that was elaborate excuse, don't you think? šŸ˜„ )

I've stripped all the thread synchronization out. There is no need for it, as there should not be any contention for resources on Ruby level. I thought about what you have indicated about buffers filling up on the system level, and as a solution will try to make sure that the writes are non-blocking to potentially prevent readers waiting. Once happy, which will definitely be this week, I will release a new version for you to try out. Thanks for your patience.

jrochkind commented 6 years ago

It's also quite possible my problem is not TTY::Command at all, this is a mysterious error. Just trying to rule everything out!

jrochkind commented 6 years ago

@piotrmurach Is it possible the threads created in read_stream could sometimes get 'leaked' and stay alive forever?

What if the line = stream.gets ends up blocking forever, or somehow reading forever, even after the run is done -- that thread will still be around. Would it make sense to make sure to stop your threads when the run completes, in an ensure?

This might actually be a worse problem, if that's what's going on. But I'm just brainstorming here.

I am actually using TTY::Command in a long-running web server process, not a rake script. A CLI utility that will exit when it's done, "leaked" threads don't matter, when the process exits all is forgiven. But for me they'd be a big problem!

piotrmurach commented 6 years ago

Cool, I'm happy to explore the problem to make sure we're covered from all angles. I don't pretend to know it all here and happy to get proven wrong.

I'm not sure that's possible to have a 'leaked' thread, however, I may be wrong here but I will try to elaborate my understanding.

The way I see it is that, when you run your external program, I spawn it into a separate process and hook 3 pipes into it. Now, two of these pipes are for reading and get placed into separate threads. The thread will be alive as long as the stream is open and readable. So what would keep the child process alive forever? Well, if the child process keeps outputting to the pipe and keeps it open. If for any reason it closes the pipe, including some abrupt stuff like kill signals, then the pipe is closed. When pipe gets closed then the line = stream.gets will definitely finish. Therefore the only way I can see the thread being alive forever is when the child process is aliave forever, like running tail on a log file etc.... but then I think that's up to the developer to manage what they run externally with systemd or some other monitoring things to avoid zombie processes.

piotrmurach commented 6 years ago

@jrochkind sorry, this is embarrassing. Got my ass kicked by flu bug and haven't looked at this code for a while. I'm really hoping to release version this weekend. Let's hope heavens will provide šŸ˜„

piotrmurach commented 6 years ago

Released v0.7.0 that should fix the deadlocking issue. Please could you give it a try and let me know how it performs?

jrochkind commented 6 years ago

I will update in my app. I won't really know if it's solved the issue unless I don't see the deadlock again for... some time. As I still can't reproduce on demand. I am not actually completely sure TTY::Command is responsible for the deadlock error. (I am relatively sure my app is not actually exhausting the total number of threads available to it, from OS-level look at how many threads it's using, and it is a deadlock for some reason).

piotrmurach commented 6 years ago

Thanks for updating the gem. I will leave this ticket open for couple of weeks to give enough time for the issue to surface in your app. Unless I hear otherwise I will assume all is good and close the ticket.