Closed karlosp closed 3 years ago
Would you be able to provide a packet capture of (at least) the TCP part of this exchange? Preferably with wireshark?
epicsThreadGetCPUs() -> 7
Unrelated to the issue reported. What kind of system has an odd number of CPU cores/hyperthreads? Is this some kind of VM?
Yes I am running this in VirtualBox and I intentionally assign one core less than I have so that commands like make -j $(nproc)
does not entirely "kill" my laptop.
I hope this Wireshark log will help.
I had a running CSS with PV Formula: pva://topic1
and then I run ./example/O.linux-x86_64/mailbox topic1
mailbox-topic1.zip
Maybe not relevant but this is an error from CSS
2021-01-12T09:19:40.304+01 SEVERE [Thread 1] org.csstudio.logging.PluginLogListener (logging) - Unhandled event loop exception
java.lang.NullPointerException
at org.diirt.support.pva.PVAChannelHandler.getProperties(PVAChannelHandler.java:314)
at org.csstudio.diag.pvmanager.probe.DetailsPanel.setChannelProperties(DetailsPanel.java:214)
at org.csstudio.diag.pvmanager.probe.DetailsPanel$1$1.run(DetailsPanel.java:194)
at org.eclipse.swt.widgets.RunnableLock.run(RunnableLock.java:40)
at org.eclipse.swt.widgets.Synchronizer.runAsyncMessages(Synchronizer.java:185)
at org.eclipse.swt.widgets.Display.runAsyncMessages(Display.java:5026)
at org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:4582)
at org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine$5.run(PartRenderingEngine.java:1173)
at org.eclipse.core.databinding.observable.Realm.runWithDefault(Realm.java:338)
at org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine.run(PartRenderingEngine.java:1062)
at org.eclipse.e4.ui.internal.workbench.E4Workbench.createAndRunUI(E4Workbench.java:155)
at org.eclipse.ui.internal.Workbench.lambda$3(Workbench.java:644)
at org.eclipse.core.databinding.observable.Realm.runWithDefault(Realm.java:338)
at org.eclipse.ui.internal.Workbench.createAndRunWorkbench(Workbench.java:566)
at org.eclipse.ui.PlatformUI.createAndRunWorkbench(PlatformUI.java:150)
at org.csstudio.utility.product.Workbench.runWorkbench(Workbench.java:99)
at org.csstudio.startup.application.Application.startApplication(Application.java:265)
at org.csstudio.startup.application.Application.start(Application.java:119)
at org.csstudio.iter.css.product.ITERApplication.start(ITERApplication.java:120)
at org.eclipse.equinox.internal.app.EclipseAppHandle.run(EclipseAppHandle.java:203)
at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.runApplication(EclipseAppLauncher.java:137)
at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.start(EclipseAppLauncher.java:107)
at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:400)
at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:255)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.eclipse.equinox.launcher.Main.invokeFramework(Main.java:661)
at org.eclipse.equinox.launcher.Main.basicRun(Main.java:597)
at org.eclipse.equinox.launcher.Main.run(Main.java:1476)
at org.eclipse.equinox.launcher.Main.main(Main.java:1449)
I hope this Wireshark log will help.
It looks like you captured only the UDP (search) traffic. The relevant part is the TCP traffic. I've added a section on packet capture to the documentation. Please let me know if this is helpful (and correct).
I may have an idea of what is going wrong. Can you re-test with the master branch (at e9ce80880d92eaf6c24309dbdcb4dcccb8750df5)? If this doesn't fix the issue, I've also added some more detail to the error message which will hopefully give some further clue.
I can confirm that the issue is fixed now.
I do not know if it is somehow related but I noticed one error in Log Messages in CSS while running the same example as described in my first post, which pops up exactly every 60s.
2021-01-13T12:17:29.262+01 WARNING [Thread 188] org.epics.pvaccess.impl.remote.codec.AbstractCodec (processHeader) - Invalid header received from client /10.0.2.15:59504, disconnecting...
I started capturing data a few seconds before the event and stopped about a second after the event. Invalid header received from client.pcapng.gz
Maybe another issue should be opened for this?
I can confirm that the issue is fixed now.
Good.
Invalid header received from client /10.0.2.15:59504, disconnecting...
I think this error message is itself in error. It indicates a protocol framing error. Based on your last packet capture, and some local tests, I think the actual cause is that the server is timing out and closing the connection.
I can see an unacknowledged CMD_ECHO from the client, and a ~200us later the server RSTs the connection. I guess this abnormal close somehow isn't handled properly in pvAccessJava and maybe junk in the RX buffer is being processed?
If I set export PVXS_LOG=*=DEBUG
(or WARN
) for the mailbox server I see eg.
2021-01-13T09:58:59.610581953 WARN pvxs.tcp.io Client 192.168.210.1:55892 connection timeout
I don't see this every time though.
The long story of inactivity timeouts with pvAccessCPP is laid out in https://github.com/epics-base/pvAccessCPP/issues/139. The short story is that originally C++ clients were not sending CMD_ECHO, and C++ servers would never timeout. I tried to address this with https://github.com/epics-base/pvAccessCPP/pull/144 .
I knew that pvAccessJava clients were sending CMD_ECHO, but it looks like I misinterpreted the meaning of the timeout configuration parameter. pvAccessJava clients are sending a echo every 30 seconds and timeout out after 60 seconds, while pvAccessCPP (and now PVXS) servers timeout after 30 seconds.
So with a C++ server, and Java client, there is a tight race between the client sending CMD_ECHO, and the server timing out. On my laptop it seems that the client echo won often enough that I didn't notice this at the time. I do sometimes see the "Invalid header" message now though.
I guess the only reasonable course of action is to increase the timeout in pvAccessCPP and PVXS from 30 seconds to 60, while leaving the echo interval at 15 seconds?
@kasemir fyi.
6861f03c60759af22064addeb8404fdde5af2983 increases the inactivity timeout to 40 seconds. A future change will make this configurable.
@mdavidsaver thanks for your quick response and detailed explanations.
With the latest commit, Invalid header received
warning does not show up any more.
Should we tag the latest commit with 0.1.1? Or at least the commit which fixed the original error.
Should we tag the latest commit with 0.1.1?
Since you didn't find a third issue today, sure!
2021-01-14 18:34:59.061 SEVERE [Thread 1] org.csstudio.logging.PluginLogListener (logging) - Unhandled event loop exception
java.lang.NullPointerException
at org.epics.pvaccess.client.impl.remote.ChannelImpl.getRemoteAddress(ChannelImpl.java:558)
at org.diirt.support.pva.PVAChannelHandler.getProperties(PVAChannelHandler.java:313)
at org.csstudio.diag.pvmanager.probe.DetailsPanel.setChannelProperties(DetailsPanel.java:214)
...
Also, I was seeing, and continue to see, a log message message similar to https://github.com/mdavidsaver/pvxs/issues/13#issuecomment-758495221. So I don't think it is related to the issue with processing of CMD_GET_FIELD
requests (aka Introspect). Thinking about null
is what led me to 0356eee74037a58e0e318ba6da5a1cc1ce2b4f82 though.
Describe the bug and steps to reproduce Running
./example/O.linux-x86_64/mailbox topic1
and then running Control System Studio on CODAC 6.2.1. Enterpva://topic1
into PV Formula causes the following stack traceExpected behavior A clear and concise description of what you expected to happen. No stack trace, get some values to CSS.
Information (please complete the following):