Possible finagle-chirper crash

axel22 commented 5 years ago

Some users reported seeing this on some machines:

Exception in thread "Thread-938" Failure(connection timed out: localhost/127.0.0.1:51834 at remote address: localhost/127.0.0.1:51834. Remote Info: Not Available, flags=0x08) with RemoteInfo -> Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: localhost/127.0.0.1:51834, Downstream label: :51834, Trace Id: 3b8504004e328e76.3b8504004e328e76<:3b8504004e328e76 with Service -> :51834
Caused by: com.twitter.finagle.ConnectionFailedException: connection timed out: localhost/127.0.0.1:51834 at remote address: localhost/127.0.0.1:51834. Remote Info: Not Available
    at com.twitter.finagle.netty4.ConnectionBuilder$$anon$1.operationComplete(ConnectionBuilder.scala:99)
    at com.twitter.finagle.netty4.ConnectionBuilder$$anon$1.operationComplete(ConnectionBuilder.scala:78)
    at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
    at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
    at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
    at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
    at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:570)
    at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
    at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:127)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:335)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at com.twitter.finagle.util.BlockingTimeTrackingThreadFactory$$anon$1.run(BlockingTimeTrackingThreadFactory.scala:23)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.ConnectTimeoutException: connection timed out: localhost/127.0.0.1:51834
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:568)
    ... 11 more

axel22 commented 5 years ago

cc @Fithos

Fithos commented 5 years ago

I wasn't able to reproduce this error on Linux and MacOS. Could you please post more details on the configuration and installed OS of the machines where finagle-chirper fails?

axel22 commented 5 years ago

Unfortunately, neither was I on my machine. I am in contact with Tom Rodriguez who reported it, who sees it happening on some cluster machines (apparently it's an issue that happens both on HotSpot and on GraalVM). cc @tkrodriguez

tkrodriguez commented 5 years ago

I've really only seen this on a fairly large machine. It's an Oracle x5-2 with 72 cores running Oracle Linux Server release 6.8. i've also seen problems with running out of file descriptors though that was with an older version of this. The limit is 65536 so it would have to be leaking fds pretty badly for this to occur. Anything I can try to help with debugging?

ceresek commented 5 years ago

The benchmark has a leaking file descriptor issue. Example exception dump:

Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedConstructorAccessor3.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at io.netty.channel.ReflectiveChannelFactory.newChannel(ReflectiveChannelFactory.java:38) ... 172 more Caused by: io.netty.channel.ChannelException: io.netty.channel.unix.Errors$NativeIoException: newSocketStream(..) failed: Too many open files at io.netty.channel.unix.Socket.newSocketStream0(Socket.java:439) at io.netty.channel.epoll.LinuxSocket.newSocketStream(LinuxSocket.java:184) at io.netty.channel.epoll.EpollSocketChannel.(EpollSocketChannel.java:45) ... 176 more Caused by: io.netty.channel.unix.Errors$NativeIoException: newSocketStream(..) failed: Too many open files at io.netty.channel.unix.Errors.newIOException(Errors.java:122) ... 179 more

ceresek commented 5 years ago

Seen more instances of this on regular machines (e.g. measly 8 core 64GB RAM). At the time of the crash, the benchmark has quite a lot of TCP sockets open. Sometimes, this makes the benchmark hang instead of crash, complicating measurements.

ceresek commented 5 years ago

... also see constantly increasing number of threads (right now ~300 threads at iteration 400), if it helps, most thread instances are named UnboundedFuturePool-Something.

axel22 commented 5 years ago

@Fithos Can you confirm the file descriptor issue, and thread growth?

ceresek commented 5 years ago

finagle-chirper

axel22 commented 5 years ago

It's somewhat surprising that the thread count seems to be converging.

But, also, it drops at the very end, and the tearDownAfterAll is a likely reason, I think. Perhaps we should change the benchmark so that it kills the server instance completely between the iterations.

vkostyukov commented 5 years ago

Maybe it's because master isn't being properly closed here? https://github.com/renaissance-benchmarks/renaissance/blob/master/benchmarks/twitter-finagle/src/main/scala/org/renaissance/twitter/finagle/FinagleChirper.scala#L430-L433

I'm also not seeing when other master is closed.

Clients in Finagle are designed to be long-lived and potentially shared across your application. They operate better and better the longer you keep them around (connections are materialized, a past performance about errors/latencies is recorded, etc).

axel22 commented 5 years ago

Hi Vladimir! Thanks for the pointer! So, essentially, this could be fixed by calling close on the Service object, correct?

vkostyukov commented 5 years ago

That's right, @axel22. Just call .close() if you don't need the client anymore. Creating a single client instance and sharing it in your benchmark would be even better workload as it's essentially what traditional Finagle applications do.

albertnetymk commented 5 years ago

Quoting from #168, "However, I hit #106 a lot locally on my MacBook. So this issue was apparently not fully resolved (cc @Fithos )."

So, the issue is still there, and should be reopened?

renaissance-benchmarks / renaissance

Possible finagle-chirper crash #106