Vaadin 23.4.0. UI's response performance degrades after some time

moryakovdv commented 5 months ago

Description of the bug

Hi. In production environment UI's responsiveness degrades significantly after some time. WildFly 23+NGINX Vaadin 23.4.0. +Springboot Automatic PUSH via WEBSOCKET Default session duration 30min is set on Wildfly.

Application has rather fast (~400ms) async UI updates (all of them called within ui.access) After initial start everything works just fine.

But after some time (hours or even days) each UI request (menu open, button pressing etc) starts to perform badly. Vaadin Loading bar starts to blink and even stuck eventually. Refreshing the UI's, Opening page in another browser don't work, so even NEW UIs perform this way.

Several thread dumps say that there is a blocking behavior on the same lock in some threads in Atmosphere engine. See screenshots. Probably there is deadlock somewhere. Maybe such behavior occurs after session expiration or switching from websocket to long polling.

Actually I don't know how can I do further investigation. In our Stage or Development environment everything works as expected. Timed out sessions die, new sessions work correctly

Any response will be appreciated!

Expected behavior

.

Minimal reproducible example

Hard to reproduce, see screenshots Selection_1264 Selection_1265 Selection_1266

Versions

Vaadin / Flow version: Vaadin 23.4.0
Java version: OpenJDK-11
OS version: Ubuntu
Browser version (if applicable): Any
Application Server (if applicable): Wildfly 23.0.1-Final + NGINX in front
IDE (if applicable):

mcollovati commented 5 months ago

Hi, thanks for creating the issue. Could you provide a full thread dump, with the trace of all application threads?

moryakovdv commented 5 months ago

@mcollovati, Marco, thanks for the answer. thread dump attached aedump-1.zip

mcollovati commented 5 months ago

It looks like that Atmosphere UUIDBroadcasterCache gets stuck while processing a terminal operation (anyMatch()) on a parallel stream.

    private boolean hasMessage(String clientId, String messageId) {
        ConcurrentLinkedQueue<CacheMessage> clientQueue = messages.get(clientId);
        return clientQueue != null && clientQueue.parallelStream().anyMatch(m -> Objects.equals(m.getId(), messageId));
    }

"default task-183" - Thread t@3646
   java.lang.Thread.State: RUNNABLE
    at java.base@11.0.22/java.util.concurrent.ConcurrentLinkedQueue$CLQSpliterator.trySplit(ConcurrentLinkedQueue.java:880)
    at java.base@11.0.22/java.util.stream.AbstractShortCircuitTask.compute(AbstractShortCircuitTask.java:114)
    at java.base@11.0.22/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:746)
    at java.base@11.0.22/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
    at java.base@11.0.22/java.util.concurrent.ForkJoinPool$WorkQueue.helpCC(ForkJoinPool.java:1115)
    at java.base@11.0.22/java.util.concurrent.ForkJoinPool.externalHelpComplete(ForkJoinPool.java:1957)
    at java.base@11.0.22/java.util.concurrent.ForkJoinTask.tryExternalHelp(ForkJoinTask.java:378)
    at java.base@11.0.22/java.util.concurrent.ForkJoinTask.externalAwaitDone(ForkJoinTask.java:323)
    at java.base@11.0.22/java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:412)
    at java.base@11.0.22/java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:736)
    at java.base@11.0.22/java.util.stream.MatchOps$MatchOp.evaluateParallel(MatchOps.java:242)
    at java.base@11.0.22/java.util.stream.MatchOps$MatchOp.evaluateParallel(MatchOps.java:196)
    at java.base@11.0.22/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
    at java.base@11.0.22/java.util.stream.ReferencePipeline.anyMatch(ReferencePipeline.java:528)
    at deployment.ROOT.war//org.atmosphere.cache.UUIDBroadcasterCache.hasMessage(UUIDBroadcasterCache.java:259)
    at deployment.ROOT.war//org.atmosphere.cache.UUIDBroadcasterCache.addMessageIfNotExists(UUIDBroadcasterCache.java:207)
    at deployment.ROOT.war//org.atmosphere.cache.UUIDBroadcasterCache.addToCache(UUIDBroadcasterCache.java:146)
    at deployment.ROOT.war//com.vaadin.flow.server.communication.LongPollingCacheFilter.filter(LongPollingCacheFilter.java:102)

The call happens during the execution of a BroadcasterFilter (LongPollingCacheFilter); execution of every filter happens in a synchronized() block that locks on the filter instance. So, all other requests, are blocked by the broadcaster lock that is not released because of the pending stream processing operation.

That said, I can't say why the stream is stuck. It looks like the clientQueue is continuously getting new elements :thinking: Does your application perhaps perform a very high rate of push invocation?

moryakovdv commented 5 months ago

Thanks for investigation. Yes, app has high rate of async updates. I'm confused with Long polling. Push annotation is set to use WEBSOCKETs. Does it mean that for some client Atmosphere switches to Long polling due to, say, network latency?

mcollovati commented 5 months ago

LongPollingCacheFilter is always executed, but it performs actions only if the transport is long polling. So, as you said, it seems that the transport is switched to long polling

mcollovati commented 5 months ago

This issue seems similar https://github.com/Atmosphere/atmosphere/issues/2262

moryakovdv commented 5 months ago

Yes, I saw that topic. But, I have no ideas how to get rid of it. It would be hard to reduce number of push invocations. Does it make sense to switch on PushMode=Manual and delegate ui.push to some queue?

mcollovati commented 5 months ago

You could maybe try to copy/paste the UUIDBroadcasterCache and rewrite the hasMessage method to perform the anyMatch on a copy of the list

moryakovdv commented 5 months ago

Well... Is it sufficient to use @BroadcasterCacheService public class MyBroadcasterCache implements BroadcasterCache {...} from the docs to use my implementation further?

mcollovati commented 5 months ago

Does it make sense to switch on PushMode=Manual and delegate ui.push to some queue?

I don't have an answer for this, sorry. It could help, but it could also only move the problem on a different layer. Anyway, it might be worth it to try. Queueing messages and perform less push calls could prevent the lock

moryakovdv commented 5 months ago

Does it make sense to switch on PushMode=Manual and delegate ui.push to some queue?

I don't have an answer for this, sorry. It could help, but it could also only move the problem on a different layer. Anyway, it might be worth it to try. Queueing messages and perform less push calls could prevent the lock

Just thought this approach could switch parallel updates to serial.

mcollovati commented 5 months ago

Well... Is it sufficient to use @BroadcasterCacheService public class MyBroadcasterCache implements BroadcasterCache {...} from the docs to use my implementation further?

IIRC you have to set it with the servlet init patameter, otherwise Flow will force UUIDBroadcasterCache

moryakovdv commented 5 months ago

IIRC you have to set it with the servlet init patameter, otherwise Flow will force UUIDBroadcasterCache

You are right, the following code in PushRequestHandler class uses UUIDBroadcasterCache

static AtmosphereFramework initAtmosphere(final ServletConfig vaadinServletConfig) {
        AtmosphereFramework atmosphere = new AtmosphereFramework(false, false) {
            @Override
            protected void analytics() {
                // Overridden to disable version number check
            }

            @Override
            public AtmosphereFramework addInitParameter(String name,
                    String value) {
                if (vaadinServletConfig.getInitParameter(name) == null) {
                    super.addInitParameter(name, value);
                }
                return this;
            }
        };

        atmosphere.addAtmosphereHandler("/*", new PushAtmosphereHandler());
        atmosphere.addInitParameter(ApplicationConfig.BROADCASTER_CACHE,
                UUIDBroadcasterCache.class.getName());
    ...

But I cannot find the proper way to force it use my CustomBroadcasterCache. I tried init-param , context-param in web.xml, @BroadcasterCacheService annotation and the following code to make it work:

@ManagedBean
public class AtmosphereInitializer implements ServletContextInitializer {
    @Override
    public void onStartup(ServletContext servletContext) {
        servletContext.setInitParameter("org.atmosphere.cpr.AtmosphereConfig.getInitParameter", "true");

        servletContext.setInitParameter("org.atmosphere.cpr.broadcaster.shareableThreadPool", "true");
        servletContext.setInitParameter("org.atmosphere.cpr.broadcaster.maxProcessingThreads", "8");
        servletContext.setInitParameter("org.atmosphere.cpr.broadcasterCacheClass",
                "com.kmp.market.CustomBroadcasterCache");
    }
}

My class com.kmp.market.CustomBroadcasterCache is completely ignored by PushRequestHandler. BTW, other instructions in the above method , e.g. servletContext.setInitParameter("org.atmosphere.cpr.broadcaster.maxProcessingThreads", "8"); work correctly.

Could you please show me the right way? Perhaps, I am in a mess with Vaadin+SpringBoot configuration.
Thanks in advance.

mcollovati commented 5 months ago

I think you need to set the parameter to the Vaadin servlet. Take a look at this comment for a similar use case: https://github.com/vaadin/flow/issues/16664#issuecomment-1552890336

Anyway, I would also investigate why the push connection is downgraded to long polling

moryakovdv commented 5 months ago

Finally, I got it. Playing with the code below CustomBroadcasterCache is linked to framework:

@Bean
    BeanPostProcessor patchAtmosphereBroadcaster() {
        return new BeanPostProcessor() {
            @Override
            public Object postProcessBeforeInitialization(Object bean, String beanName) throws BeansException {
                if (bean instanceof ServletRegistrationBean<?>) {
                    ServletRegistrationBean<?> reg = (ServletRegistrationBean<?>) bean;
                    if (reg.getServlet() instanceof SpringServlet) {
                        reg.addInitParameter("org.atmosphere.cpr.AtmosphereConfig.getInitParameter", "true");
                        reg.addInitParameter("org.atmosphere.cpr.maxSchedulerThread", String.valueOf(maxSchedulerThread));

                        reg.addInitParameter("org.atmosphere.cpr.broadcaster.shareableThreadPool", "true");
                        reg.addInitParameter("org.atmosphere.cpr.broadcaster.maxProcessingThreads", String.valueOf(maxProcessingThreads));
                        reg.addInitParameter("org.atmosphere.cpr.broadcaster.maxAsyncWriteThreads", String.valueOf(maxAsyncWriteThreads));

                        reg.addInitParameter("org.atmosphere.cpr.broadcasterCacheClass", "com.kmp.market.CustomBroadcasterCache");
                    }
                }
                return bean;
            }
        };
    }

moryakovdv commented 5 months ago

Now some investigations:

without any further modifications I added

private boolean hasMessage(String clientId, String messageId) { 
....
int size = clientQueue.size();
if (size>0)
System.out.println(size);
...

Browser requested application with NGINX started, WEBSOCKET-transport instantiated and works as expected
The seldom output shows 1-2 messages in the queue
Stop NGINX, Browser shows Connection lost
Start NGINX again, Browser reconnects
The output counter becomes mad and shows the constantly increasing number of messages in the queue

So, probably we have a Long polling after emulated NGINX restart.

moryakovdv commented 5 months ago

And more: Simply put a breakpoint into hasMessage and any UI will stuck with Vaadin loader with no response, even when that UI was connected without proxy.

mcollovati commented 5 months ago

And more: Simply put a breakpoint into hasMessage and any UI will stuck with Vaadin loader with no response, even when that UI was connected without proxy.

I think this is somehow expected. When pushing changes to the client, a VaadinSession lock is held, so if you block execution in hasMessage, any other access to VaadinSession will wait for the VaadinSession lock to be released.

mcollovati commented 5 months ago

The output counter becomes mad and shows the constantly increasing number of messages in the queue

This is probably because while Atmosphere is trying to push cached messages to the client, the application is constantly adding other new messages, making the queue never empty.

So, probably we have a Long polling after emulated NGINX restart.

You can check it in the browser network tab: if the web socket channel is closed, you might see HTTP push requests happen continuously.

mcollovati commented 5 months ago

Probably one of the questions here is: would it be possible to reconnect with websocket transport after a network failure, instead of falling back to long polling?

mcollovati commented 5 months ago

If you are confident that web socket will ALWAYS work for the application clients, you can set websocket as fallback transport as well.

    @Bean
    VaadinServiceInitListener configureWebsocketFallbackForPush() {
        return serviceEvent ->  serviceEvent.getSource().addUIInitListener(uiEvent -> {
            uiEvent.getUI().getPushConfiguration().setFallbackTransport(Transport.WEBSOCKET);
        });
    }

moryakovdv commented 5 months ago

If you are confident that web socket will ALWAYS work for the application clients, you can set websocket as fallback transport as well.

Thanks for the suggestion. In what case the fallback again to websockets instead of long-polling may insult the application? Connection issues? Old browsers? Some security settings?

mcollovati commented 5 months ago

I would say in cases where the client may not be able to establish a web socket connection at all. Old browser could be a case. Probably also, corporate firewall and proxies may in some cases block some request. Or network connection timeouts. Without the long polling fallback, push will basically not work, and changes will be propagated to the client only as a result of a server interaction (e.g. a click on a button)

moryakovdv commented 2 months ago

Hello. We made further investigation. Now we got rid from complete application hangs by using some modifications in BroadcasterCache but still we have performance issues with BroadcasterConfig. Please, have a look on points at the picture. Selection_1432

connection is stable, server's CPU consumption is calm
the network interface of the client is switched-off. Network tab in client's Firefox shows error in web-socket connection and loading-line is staring to blink
the network interface of the client is switched-ON. Firefox shows reconnection messages and starts long-polling requests. Server logs shows "famous" message like 'UnsupportedOperationException: Unexpected message id from the client. Expected sync id: 137, got 138` The CPU on server starts burning (probably we have a lot of messages to broadcast to client we lost on step 2)
User reloads page (pressing F5) in Firefox. Session is closed and consumption calms down.

While investigating thread dumps I found that default BroadcasterConfig is stuck in filter() in 'synchronized' block.

Probably old LongPollingCacheFilter object from previous session was not removed or renewed after client's reconnect.

I would appreciate your ideas about how to get rid of such situation.

PS: we have NGINX on stage and production as a front. the above bug occurs in both environments, no matter what timings (proxy_send, proxy_connect etc... ) are set on NGINXs

vaadin / flow