jetty / jetty.project

Eclipse Jetty® - Web Container & Clients - supports HTTP/2, HTTP/1.1, HTTP/1.0, websocket, servlets, and more
https://eclipse.dev/jetty
Other
3.87k stars 1.91k forks source link

Embedded Jetty stops responding and then hangs while stopping #6059

Closed bhawani1978 closed 2 years ago

bhawani1978 commented 3 years ago

Jetty version 9.4.33.v20201020

Java version JRE-8u271

OS type/version Windows Server

We are running embedded jetty in our java application, and perform a health check by calling a simple webservice, which just returns a success status. This health check done in main thread about every 1 minute and if we don't get response in time out interval (20 seconds), then we try to restart embedded jetty server by calling stop and later start method. We have noticed sometimes, we don’t see any trace for health check call in jetty log but we see other request are being served. Than after 20 sec we receive time out socketException, main thread thinks, service is not in good status and try to restart jetty, and then it hangs in stop.

From time to time we see this TimeoutException though: java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:171) at org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:113) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)

attaching rest of the jetty longs. jetty-hang-log.zip

sbordet commented 3 years ago

Sorry, it's not clear what you're doing, nor where the problem is.

From your logs I see server-side logs when from your description you have a problem in the client, so it's not clear if you have a client problem contacting an external webservice, or you also run the webservice with Jetty? Is the problem in the client or in the server?

It's not clear what you mean by "restart Jetty" because you're speaking about the client, while "Jetty" typically indicates the server.

Also from your logs, the server is just idle and perfectly working (it just served a request few seconds before the end of the logs).

Also from your logs, you seem to use custom extensions of Jetty classes (WorkaroundManagedSelector) which may be buggy.

I doubt it's Jetty the problem here. I would double check your application.

joakime commented 3 years ago

Idle timeout rarely point at anything useful for troubleshooting this. It only means you have a client connection that hasn't been closed by either the client or server side.

When you experience a "hung jetty", do the following, before your restart.

Some example requests ...

# 400 request
$ curl -vvv http://localhost:8080/"hello world"
*   Trying 127.0.0.1:8080...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /hello world HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 400 Illegal character SPACE=' '
< Content-Type: text/html;charset=iso-8859-1
< Content-Length: 70
< Connection: close
< Server: Jetty(9.4.38.v20210224)
< 
* Closing connection 0
<h1>Bad Message 400</h1><pre>reason: Illegal character SPACE=' '</pre>

$ curl -vvv -H "Content-Length: abc" http://localhost:8080/
*   Trying 127.0.0.1:8080...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET / HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.68.0
> Accept: */*
> Content-Length: abc
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 400 Invalid Content-Length Value
< Content-Type: text/html;charset=iso-8859-1
< Content-Length: 71
< Connection: close
< Server: Jetty(9.4.38.v20210224)
< 
* Closing connection 0
<h1>Bad Message 400</h1><pre>reason: Invalid Content-Length Value</pre>

# 404 requests
$ curl -vvv http://localhost:8080/%2e;/
*   Trying 127.0.0.1:8080...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /%2e HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 404 Not Found
< Cache-Control: must-revalidate,no-cache,no-store
< Content-Type: text/html;charset=iso-8859-1
< Content-Length: 441
< Server: Jetty(9.4.38.v20210224)
< 
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404 Not Found</h2>
<table>
<tr><th>URI:</th><td>/%2e</td></tr>
<tr><th>STATUS:</th><td>404</td></tr>
<tr><th>MESSAGE:</th><td>Not Found</td></tr>
<tr><th>SERVLET:</th><td>-</td></tr>
</table>
<hr><a href="https://eclipse.org/jetty">Powered by Jetty:// 9.4.38.v20210224</a><hr/>

</body>
</html>
$ echo -e "GET ??? HTTP/1.1\r\nHost: localhost\r\nConnection: close\r\n\r\n" | nc localhost 8080
HTTP/1.1 400 Bad URI
Content-Type: text/html;charset=iso-8859-1
Content-Length: 50
Connection: close
Server: Jetty(9.4.38.v20210224)

<h1>Bad Message 400</h1><pre>reason: Bad URI</pre>

What we are trying to do is understand the state of your server. (the server dump helps here)

And we are trying to isolate WHERE in the server you are experiencing the problems. Is Jetty truly hung? (the only way we've experienced this is due to misconfigured thread pools) Is the application hung? (the requests for content that doesn't pass through your application tests this) Is something else impacting you? (eg: database pools)

bhawani1978 commented 3 years ago

Sorry, it's not clear what you're doing, nor where the problem is.

From your logs I see server-side logs when from your description you have a problem in the client, so it's not clear if you have a client problem contacting an external webservice, or you also run the webservice with Jetty? Is the problem in the client or in the server?

It's not clear what you mean by "restart Jetty" because you're speaking about the client, while "Jetty" typically indicates the server.

Also from your logs, the server is just idle and perfectly working (it just served a request few seconds before the end of the logs).

Also from your logs, you seem to use custom extensions of Jetty classes (WorkaroundManagedSelector) which may be buggy.

I doubt it's Jetty the problem here. I would double check your application.

I mean Restart jetty, by calling server.stop() and then server.start().

bhawani1978 commented 3 years ago

I see two jetty threads waiting at WindowsSelectorImpl$SubSelector.poll0, rest are at Unsafe.park.

"qtp2056418216-11752" Id=11752 TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@110f2e5d at sun.misc.Unsafe.park(Native Method)

"qtp2056418216-11747" Id=11747 TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@110f2e5d at sun.misc.Unsafe.park(Native Method)

"qtp2056418216-11729" Id=11729 RUNNABLE (in native) at sun.nio.ch.WindowsSelectorImpl$SubSelector.poll0(Native Method) at sun.nio.ch.WindowsSelectorImpl$SubSelector.poll(Unknown Source) at sun.nio.ch.WindowsSelectorImpl$SubSelector.access$400(Unknown Source) at sun.nio.ch.WindowsSelectorImpl.doSelect(Unknown Source) at sun.nio.ch.SelectorImpl.lockAndDoSelect(Unknown Source)

"qtp2056418216-11678" Id=11678 RUNNABLE (in native) at sun.nio.ch.WindowsSelectorImpl$SubSelector.poll0(Native Method) at sun.nio.ch.WindowsSelectorImpl$SubSelector.poll(Unknown Source) at sun.nio.ch.WindowsSelectorImpl$SubSelector.access$400(Unknown Source) at sun.nio.ch.WindowsSelectorImpl.doSelect(Unknown Source) at sun.nio.ch.SelectorImpl.lockAndDoSelect(Unknown Source)

"qtp2056418216-11657" Id=11657 TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@110f2e5d at sun.misc.Unsafe.park(Native Method)

To monitor service performance, we have a WS client within this program, with 20 second timeout, test service, simply returns a String. When this client fails due to time out, program tries to restart by stoping embeded server and starting it, where it seems it gets stuck. "main" Id=1 WAITING on java.util.concurrent.CountDownLatch$Sync@145dc24a at sun.misc.Unsafe.park(Native Method)

sbordet commented 3 years ago

I see two jetty threads waiting at WindowsSelectorImpl$SubSelector.poll0, rest are at Unsafe.park.

That's normal.

To monitor service performance, we have a WS client within this program, with 20 second timeout, test service, simply returns a String. When this client fails due to time out, program tries to restart by stoping embeded server and starting it, where it seems it gets stuck.

You should report what happens when this client fails, and in what status the server is when the server fails. Please call Server.dump() and report the resulting string in this issue.

So far, what you have reported is a normally, perfectly working server. We need to know exactly what happens at the moment the client fails.

joakime commented 3 years ago

@bhawani1978 what are the results of the information collection outlined in my prior comment?

bhawani1978 commented 3 years ago

As of now we are not taking server.dump before stopping. We will make changes to capture before trying to restart.

Also are we having time out of 20 seconds too low before trying to restart, which is causing the problem while stuck in stop.

joakime commented 3 years ago

Keep in mind that we are not asking for a memory dump, or a class dump, or any other kind of normal JVM dump.

We are specifically asking for a Jetty Server.dump(). You can access this from JMX as an operation on the server.

bhawani1978 commented 3 years ago

We are able to get the server dump when this issue is happening. jetty_dump.txt

Clients requests are getting time out. Seems old connections are not getting cleaned up, please let us your view.

sbordet commented 3 years ago

@bhawani1978 the dump you report shows that the selector threads are gone, and update tasks are accumulating into the selectors, but there is no thread to run them (nor to perform the select operation).

You appear to be running some custom code (WorkaroundManagedSelector), so unless we know exactly what is your code doing, we cannot know what's going on.

Unfortunately Windows's NIO implementation in OpenJDK is not as stable as Linux's so there is not much we can do. One option for you is to try and use Java 13 or later, can you try that? In Java 13 the whole socket implementation has been rewritten, so there may be a chance that you don't need any workaround for NIO on Windows (not a guarantee, but worth a try).

What is it that you are working around in WorkaroundManagedSelector? Are you sure you are doing the right thing? I ask because the selector threads are gone and I suspect that it is your workaround that makes them go, and that is not good.

bhawani1978 commented 3 years ago

Hi @sbordet, we used https://github.com/jetty-project/selector-hack as suggested in https://github.com/eclipse/jetty.project/issues/2205.

sbordet commented 3 years ago

@bhawani1978 have you actually hit the Windows selector spin problem so that you need the workaround?

The code you are using is highly experimental as written in the projects README, so there might be a chance that it does not work in your case.

I don't see how WorkaroundManagedSelector could kill the selector thread, so there must be something else going on that causes another issue that makes the selector thread go away.

You should have logging in your logs that the selector has been rebuilt, and if the selector thread is gone you should have some stack trace of that too in the logs.

bhawani1978 commented 3 years ago

On one of the environment, we see that issue starts intermittently and finally over a day or so, it completly stalls it. I was able to get current jetty logs in debug, thread dump and server dump, when its working but on and off.

jetty_logs.log jetty_server_dump.txt jetty_thread_dump.txt

sbordet commented 3 years ago

@bhawani1978 I can see the selecting threads, but the selectors are in a borked state.

Can you please call QueuedThreadPool.setDetailedDump(true) either in the code or via JMX, and dump the server again with this new setting and the system in borked state?

Also, can you please compare and tell us if your WorkaroundManagedSelector is exactly equal to what we have in the experimental project? I ask because the minimum code change in this component may render the server invalid.

I hate to say, but I have to, can you move to Linux? I mean, even if we get this one worked around properly (which is not guaranteed), how long before you hit another Windows NIO issue?

bhawani1978 commented 3 years ago

@sbordet thanks for the update.

Our WorkaroundManagedSelector and WorkaroundServerConnector is exactly same, line by line. I will try to enable QueuedThreadPool.setDetailedDump(true) get more details. Also we are thinking about moving to Linux but will take sometime.

Also I noticed that issue starts with intermittent timeout for some requests, where one fails one time passes next time. It takes hours before completely shutting this down for us.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has been a full year without activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 2 years ago

This issue has been closed due to it having no activity.