Client Error while processing cmd 0x11

karlosp commented 3 years ago

Describe the bug and steps to reproduce Running ./example/O.linux-x86_64/mailbox topic1 and then running Control System Studio on CODAC 6.2.1. Enter pva://topic1 into PV Formula causes the following stack trace

Effective config
EPICS_PVAS_INTF_ADDR_LIST="0.0.0.0"
EPICS_PVAS_BEACON_ADDR_LIST="10.0.2.255:5076 10.0.2.255 192.168.122.255"
EPICS_PVAS_AUTO_BEACON_ADDR_LIST=NO
EPICS_PVAS_SERVER_PORT=5075
EPICS_PVAS_BROADCAST_PORT=5076
Running
2021-01-11T12:53:03.549084029 CRIT pvxs.tcp.io Client Error while processing cmd 0x11: Error decoding Introspect
Dumping a stack trace of thread 'PVXTCP':
[    0x7f9e16da843b]: /opt/codac-6.2/epics-base/lib/linux-x86_64/libCom.so.3.18.1(epicsStackTrace+0x4b)
[    0x7f9e16ff6b8a]: /opt/codac-6.2/pvxs/lib/linux-x86_64/libpvxs.so.0.1(_ZN4pvxs6detail11_log_printfEjPKcz+0xaa)
[    0x7f9e1704a259]: /opt/codac-6.2/pvxs/lib/linux-x86_64/libpvxs.so.0.1(_ZN4pvxs4impl8ConnBase7bevReadEv+0x769)
[    0x7f9e1705c1a9]: /opt/codac-6.2/pvxs/lib/linux-x86_64/libpvxs.so.0.1(_ZN4pvxs4impl10ServerConn7bevReadEv+0x9)
[    0x7f9e1704a4c1]: /opt/codac-6.2/pvxs/lib/linux-x86_64/libpvxs.so.0.1(_ZN4pvxs4impl8ConnBase8bevReadSEP11buffereventPv+0x21)
[    0x7f9e1626119c]: /opt/codac-6.2/pvxs/bundle/usr/linux-x86_64/lib/libevent_core-2.2.so.1(bufferevent_run_deferred_callbacks_locked+0x7c)
[    0x7f9e1626a0b8]: /opt/codac-6.2/pvxs/bundle/usr/linux-x86_64/lib/libevent_core-2.2.so.1(event_process_active_single_queue.isra.33+0x268)
[    0x7f9e1626a8af]: /opt/codac-6.2/pvxs/bundle/usr/linux-x86_64/lib/libevent_core-2.2.so.1(event_base_loop+0x3ff)
[    0x7f9e1703b739]: /opt/codac-6.2/pvxs/lib/linux-x86_64/libpvxs.so.0.1(_ZN4pvxs4impl6evbase3Pvt3runEv+0x109)
[    0x7f9e16d9c639]: /opt/codac-6.2/epics-base/lib/linux-x86_64/libCom.so.3.18.1(epicsThreadCallEntryPoint+0x69)
[    0x7f9e16da263c]: /opt/codac-6.2/epics-base/lib/linux-x86_64/libCom.so.3.18.1(start_routine+0xdc)
[    0x7f9e15e3ae25]: /lib64/libpthread.so.0(start_thread+0xc5)
[    0x7f9e1657834d]: /lib64/libc.so.6(clone+0x6d)
2021-01-11T12:53:06.631608105 CRIT pvxs.tcp.io Client Error while processing cmd 0x11: Error decoding Introspect
Dumping a stack trace of thread 'PVXTCP':
[    0x7f9e16da843b]: /opt/codac-6.2/epics-base/lib/linux-x86_64/libCom.so.3.18.1(epicsStackTrace+0x4b)
[    0x7f9e16ff6b8a]: /opt/codac-6.2/pvxs/lib/linux-x86_64/libpvxs.so.0.1(_ZN4pvxs6detail11_log_printfEjPKcz+0xaa)
[    0x7f9e1704a259]: /opt/codac-6.2/pvxs/lib/linux-x86_64/libpvxs.so.0.1(_ZN4pvxs4impl8ConnBase7bevReadEv+0x769)
^C[    0x7f9e1705c1a9]: /opt/codac-6.2/pvxs/lib/linux-x86_64/libpvxs.so.0.1(_ZN4pvxs4impl10ServerConn7bevReadEv+0x9)
[    0x7f9e1704a4c1]: /opt/codac-6.2/pvxs/lib/linux-x86_64/libpvxs.so.0.1(_ZN4pvxs4impl8ConnBase8bevReadSEP11buffereventPv+0x21)
[    0x7f9e1626119c]: /opt/codac-6.2/pvxs/bundle/usr/linux-x86_64/lib/libevent_core-2.2.so.1(bufferevent_run_deferred_callbacks_locked+0x7c)
[    0x7f9e1626a0b8]: /opt/codac-6.2/pvxs/bundle/usr/linux-x86_64/lib/libevent_core-2.2.so.1(event_process_active_single_queue.isra.33+0x268)
[    0x7f9e1626a8af]: /opt/codac-6.2/pvxs/bundle/usr/linux-x86_64/lib/libevent_core-2.2.so.1(event_base_loop+0x3ff)
[    0x7f9e1703b739]: /opt/codac-6.2/pvxs/lib/linux-x86_64/libpvxs.so.0.1(_ZN4pvxs4impl6evbase3Pvt3runEv+0x109)
[    0x7f9e16d9c639]: /opt/codac-6.2/epics-base/lib/linux-x86_64/libCom.so.3.18.1(epicsThreadCallEntryPoint+0x69)

Expected behavior A clear and concise description of what you expected to happen. No stack trace, get some values to CSS.

Information (please complete the following):

╰─ ./pvxinfo -D
Host: linux-x86_64
Target: linux-x86_64 Linux gcc
Toolchain
    __cplusplus = 201103
    GCC 4.8.5
    GLIBC 2.17
    __GLIBCXX__ 20150623
Versions
    PVXS 0.1.1 (0.1.0-20-g26cf7f00e607b6f5383f-dirty)
    EPICS 7.0.4.1-DEV
    libevent 2.2.0-alpha-dev
Runtime
    uname() -> Linux karlosp.codac-dev 3.10.0-693.2.1.el7.x86_64 #1 SMP Fri Aug 11 04:58:43 EDT 2017 x86_64
    epicsThreadGetCPUs() -> 7
    osiLocalAddr() -> 10.0.2.15
    osiSockDiscoverBroadcastAddresses() ->
        10.0.2.255
        192.168.122.255
Effective Client config from environment
    EPICS_PVA_ADDR_LIST="10.0.2.255:5076 10.0.2.255 192.168.122.255"
    EPICS_PVA_AUTO_ADDR_LIST=NO
    EPICS_PVA_BROADCAST_PORT=5076
Effective Server config from environment
    EPICS_PVAS_INTF_ADDR_LIST="0.0.0.0"
    EPICS_PVAS_BEACON_ADDR_LIST="10.0.2.255:5076 10.0.2.255 192.168.122.255"
    EPICS_PVAS_AUTO_BEACON_ADDR_LIST=NO
    EPICS_PVAS_SERVER_PORT=5075
    EPICS_PVAS_BROADCAST_PORT=5076

mdavidsaver commented 3 years ago

Would you be able to provide a packet capture of (at least) the TCP part of this exchange? Preferably with wireshark?

mdavidsaver commented 3 years ago

    epicsThreadGetCPUs() -> 7

Unrelated to the issue reported. What kind of system has an odd number of CPU cores/hyperthreads? Is this some kind of VM?

karlosp commented 3 years ago

Yes I am running this in VirtualBox and I intentionally assign one core less than I have so that commands like make -j $(nproc) does not entirely "kill" my laptop.

I hope this Wireshark log will help.

I had a running CSS with PV Formula: pva://topic1 and then I run ./example/O.linux-x86_64/mailbox topic1 mailbox-topic1.zip

Maybe not relevant but this is an error from CSS

2021-01-12T09:19:40.304+01 SEVERE [Thread 1] org.csstudio.logging.PluginLogListener (logging) - Unhandled event loop exception
java.lang.NullPointerException
    at org.diirt.support.pva.PVAChannelHandler.getProperties(PVAChannelHandler.java:314)
    at org.csstudio.diag.pvmanager.probe.DetailsPanel.setChannelProperties(DetailsPanel.java:214)
    at org.csstudio.diag.pvmanager.probe.DetailsPanel$1$1.run(DetailsPanel.java:194)
    at org.eclipse.swt.widgets.RunnableLock.run(RunnableLock.java:40)
    at org.eclipse.swt.widgets.Synchronizer.runAsyncMessages(Synchronizer.java:185)
    at org.eclipse.swt.widgets.Display.runAsyncMessages(Display.java:5026)
    at org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:4582)
    at org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine$5.run(PartRenderingEngine.java:1173)
    at org.eclipse.core.databinding.observable.Realm.runWithDefault(Realm.java:338)
    at org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine.run(PartRenderingEngine.java:1062)
    at org.eclipse.e4.ui.internal.workbench.E4Workbench.createAndRunUI(E4Workbench.java:155)
    at org.eclipse.ui.internal.Workbench.lambda$3(Workbench.java:644)
    at org.eclipse.core.databinding.observable.Realm.runWithDefault(Realm.java:338)
    at org.eclipse.ui.internal.Workbench.createAndRunWorkbench(Workbench.java:566)
    at org.eclipse.ui.PlatformUI.createAndRunWorkbench(PlatformUI.java:150)
    at org.csstudio.utility.product.Workbench.runWorkbench(Workbench.java:99)
    at org.csstudio.startup.application.Application.startApplication(Application.java:265)
    at org.csstudio.startup.application.Application.start(Application.java:119)
    at org.csstudio.iter.css.product.ITERApplication.start(ITERApplication.java:120)
    at org.eclipse.equinox.internal.app.EclipseAppHandle.run(EclipseAppHandle.java:203)
    at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.runApplication(EclipseAppLauncher.java:137)
    at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.start(EclipseAppLauncher.java:107)
    at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:400)
    at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:255)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at org.eclipse.equinox.launcher.Main.invokeFramework(Main.java:661)
    at org.eclipse.equinox.launcher.Main.basicRun(Main.java:597)
    at org.eclipse.equinox.launcher.Main.run(Main.java:1476)
    at org.eclipse.equinox.launcher.Main.main(Main.java:1449)

mdavidsaver commented 3 years ago

I hope this Wireshark log will help.

It looks like you captured only the UDP (search) traffic. The relevant part is the TCP traffic. I've added a section on packet capture to the documentation. Please let me know if this is helpful (and correct).

mdavidsaver commented 3 years ago

I may have an idea of what is going wrong. Can you re-test with the master branch (at e9ce80880d92eaf6c24309dbdcb4dcccb8750df5)? If this doesn't fix the issue, I've also added some more detail to the error message which will hopefully give some further clue.

karlosp commented 3 years ago

I can confirm that the issue is fixed now.

I do not know if it is somehow related but I noticed one error in Log Messages in CSS while running the same example as described in my first post, which pops up exactly every 60s.

2021-01-13T12:17:29.262+01 WARNING [Thread 188] org.epics.pvaccess.impl.remote.codec.AbstractCodec (processHeader) - Invalid header received from client /10.0.2.15:59504, disconnecting...

I started capturing data a few seconds before the event and stopped about a second after the event. Invalid header received from client.pcapng.gz

Maybe another issue should be opened for this?

mdavidsaver commented 3 years ago

I can confirm that the issue is fixed now.

Good.

Invalid header received from client /10.0.2.15:59504, disconnecting...

I think this error message is itself in error. It indicates a protocol framing error. Based on your last packet capture, and some local tests, I think the actual cause is that the server is timing out and closing the connection.

I can see an unacknowledged CMD_ECHO from the client, and a ~200us later the server RSTs the connection. I guess this abnormal close somehow isn't handled properly in pvAccessJava and maybe junk in the RX buffer is being processed?

If I set export PVXS_LOG=*=DEBUG (or WARN) for the mailbox server I see eg.

2021-01-13T09:58:59.610581953 WARN pvxs.tcp.io Client 192.168.210.1:55892 connection timeout

I don't see this every time though.

The long story of inactivity timeouts with pvAccessCPP is laid out in https://github.com/epics-base/pvAccessCPP/issues/139. The short story is that originally C++ clients were not sending CMD_ECHO, and C++ servers would never timeout. I tried to address this with https://github.com/epics-base/pvAccessCPP/pull/144 .

I knew that pvAccessJava clients were sending CMD_ECHO, but it looks like I misinterpreted the meaning of the timeout configuration parameter. pvAccessJava clients are sending a echo every 30 seconds and timeout out after 60 seconds, while pvAccessCPP (and now PVXS) servers timeout after 30 seconds.

So with a C++ server, and Java client, there is a tight race between the client sending CMD_ECHO, and the server timing out. On my laptop it seems that the client echo won often enough that I didn't notice this at the time. I do sometimes see the "Invalid header" message now though.

I guess the only reasonable course of action is to increase the timeout in pvAccessCPP and PVXS from 30 seconds to 60, while leaving the echo interval at 15 seconds?

@kasemir fyi.

mdavidsaver commented 3 years ago

6861f03c60759af22064addeb8404fdde5af2983 increases the inactivity timeout to 40 seconds. A future change will make this configurable.

karlosp commented 3 years ago

@mdavidsaver thanks for your quick response and detailed explanations.

With the latest commit, Invalid header received warning does not show up any more.

Should we tag the latest commit with 0.1.1? Or at least the commit which fixed the original error.

mdavidsaver commented 3 years ago

Should we tag the latest commit with 0.1.1?

Since you didn't find a third issue today, sure!

mdavidsaver commented 3 years ago

2021-01-14 18:34:59.061 SEVERE [Thread 1] org.csstudio.logging.PluginLogListener (logging) - Unhandled event loop exception
java.lang.NullPointerException
        at org.epics.pvaccess.client.impl.remote.ChannelImpl.getRemoteAddress(ChannelImpl.java:558)
        at org.diirt.support.pva.PVAChannelHandler.getProperties(PVAChannelHandler.java:313)
        at org.csstudio.diag.pvmanager.probe.DetailsPanel.setChannelProperties(DetailsPanel.java:214)
...

Also, I was seeing, and continue to see, a log message message similar to https://github.com/mdavidsaver/pvxs/issues/13#issuecomment-758495221. So I don't think it is related to the issue with processing of CMD_GET_FIELD requests (aka Introspect). Thinking about null is what led me to 0356eee74037a58e0e318ba6da5a1cc1ce2b4f82 though.

mdavidsaver commented 3 years ago

In https://github.com/mdavidsaver/pvxs/releases/tag/0.1.1

mdavidsaver / pvxs

Client Error while processing cmd 0x11 #13