paulscherrerinstitute / pcaspy

Portable Channel Access Server in Python
BSD 3-Clause "New" or "Revised" License
32 stars 24 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 66: invalid start byte #59

Closed hinxx closed 5 years ago

hinxx commented 5 years ago

I'm getting quite a lot of messages like these on the local network. The PVs are behind a PV gateway.

I'm not sure how critical the messages are, any idea? Thanks!

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 66: invalid start byte
[New Thread 0x7fff820b3700 (LWP 20742)]
filename="../../../../../../src/ca/legacy/pcas/generic/st/casDGIntfOS.cc" line number=464
Internal failure unexpected problem with UDP input from "iscr-lebt-cagw-01-d2.tn.esss.lu.se:42894"
Exception in thread CAServerThread:
RuntimeError: SWIG director method error. Exception Calling Python Code

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib64/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hinkokocevar/webpvsoftioc/webpvioc/server.py", line 79, in target
    self.server.process(0.1)
  File "/usr/lib64/python3.5/site-packages/pcaspy/driver.py", line 825, in process
    cas.process(delay)
SystemError: <built-in function process> returned a result with an error set

[Thread 0x7fff8a7dc700 (LWP 19841) exited]
xiaoqiangwang commented 5 years ago

0x92 in latin1 is a symbol. Is it accidentally in your source code and the file is not encoded in utf-8? Can you share your database definition?

hinxx commented 5 years ago

I surely can, I have several of DBs riding on the single server instance, utilizing multiple drivers, each with its own port name and a set of PVs to handle.

I've copy/pasted each of the DB definitions below:

pv_db = {
    # proton time frame array sizes PV
    'PROTON:FRAME:NELM': {
        'scan':  0,
        'count': 1,
        'type': 'int',
        'value': pv_array_size
    },
    # proton count PVs with different time resolution
    'PROTON:TOTAL': {
        'scan':  1.0,
        'count': 1,
        'type': 'float',
    },
    'PROTON:SECOND': {
        'scan':  1.0,
        'count': 1,
        'type': 'float',
    },
    'PROTON:MINUTE': {
        'scan':  1.0,
        'count': 1,
        'type': 'float',
    },
    'PROTON:HOUR': {
        'scan':  1.0,
        'count': 1,
        'type': 'float',
    },
    'PROTON:DAY': {
        'scan':  1.0,
        'count': 1,
        'type': 'float',
    },
    # proton count array PVs with different time resolution
    'PROTON:SECONDS': {
        'scan':  1.0,
        'count': pv_array_size,
        'type': 'float',
    },
    'PROTON:MINUTES': {
        'scan':  1.0,
        'count': pv_array_size,
        'type': 'float',
    },
    'PROTON:HOURS': {
        'scan':  1.0,
        'count': pv_array_size,
        'type': 'float',
    },
    'PROTON:DAYS': {
        'scan':  1.0,
        'count': pv_array_size,
        'type': 'float',
    },
    # proton rate PVs with different time resolution
    'PROTON:SECOND:RATE': {
        'scan':  1.0,
        'count': 1,
        'type': 'float',
    },
    'PROTON:MINUTE:RATE': {
        'scan':  1.0,
        'count': 1,
        'type': 'float',
    },
    'PROTON:HOUR:RATE': {
        'scan':  1.0,
        'count': 1,
        'type': 'float',
    },
    'PROTON:DAY:RATE': {
        'scan':  1.0,
        'count': 1,
        'type': 'float',
    },
    # proton rate array PVs with different time resolution
    'PROTON:SECONDS:RATE': {
        'scan':  1.0,
        'count': pv_array_size,
        'type': 'float',
    },
    'PROTON:MINUTES:RATE': {
        'scan':  1.0,
        'count': pv_array_size,
        'type': 'float',
    },
    'PROTON:HOURS:RATE': {
        'scan':  1.0,
        'count': pv_array_size,
        'type': 'float',
    },
    'PROTON:DAYS:RATE': {
        'scan':  1.0,
        'count': pv_array_size,
        'type': 'float',
    },
}

pv_db = {
    'SRV:TIME': {
        'scan':  1.0,
        'count': 40,
        'type': 'char'
    }
}

pv_db = {
    # image
    'COUNTER': {
        'scan':  1.0,
        'count': 1,
        'type': 'int',
        'value': 0
    },
    'IMAGE': {
        'scan':  1.0,
        'count': 40000,
        'type': 'short',
    },
    'XSIZE': {
        'scan':  1.0,
        'count': 1,
        'type': 'int',
        'value': 0
    },
    'YSIZE': {
        'scan':  1.0,
        'count': 1,
        'type': 'int',
        'value': 0
    },
    # profile
    'PROFILE': {
        'scan':  1.0,
        'count': 500,
        'type': 'float',
    },
    'PROFILESZ': {
        'scan':  1.0,
        'count': 1,
        'type': 'int',
        'value': 0
    },
    # position and sigma
    'POSITION': {
        'scan':  1.0,
        'count': 1,
        'type': 'float',
        'value': 0
    },
    'SIGMA': {
        'scan':  1.0,
        'count': 1,
        'type': 'float',
        'value': 0
    },
}

pv_db = {
    # image
    'PULSE': {
        'scan':  1.0,
        'count': pv_array_size,
        'type': 'float'
    },
    'PULSESZ': {
        'scan':  0.0,
        'count': 1,
        'type': 'int',
        'value': pv_array_size
    },
}
xiaoqiangwang commented 5 years ago

Your database looks all fine to me. I thought it could be from some ill character in "units" field.

Looking closely at the error message, it rises from the UDP message handling. Could be that some CA client is searching for a channel name with a 0x92 char? Can you run the example/casnooper.py and check the log?

hinxx commented 5 years ago

Will try out with the example/casnooper.py.

Meanwhile, I got this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 65: invalid start byte
[New Thread 0x7fff845a5700 (LWP 21357)]
filename="../../../../../../src/ca/legacy/pcas/generic/st/casDGIntfOS.cc" line number=464
Internal failure unexpected problem with UDP input from "iscr-lebt-cagw-01-d2.tn.esss.lu.se:33514"
Exception in thread CAServerThread:
RuntimeError: SWIG director method error. Exception Calling Python Code

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib64/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hinkokocevar/webpvsoftioc/webpvioc/server.py", line 79, in target
    self.server.process(0.1)
  File "/usr/lib64/python3.5/site-packages/pcaspy/driver.py", line 825, in process
    cas.process(delay)
SystemError: <built-in function process> returned a result with an error set
xiaoqiangwang commented 5 years ago

I tried to simulate a client searching for PV with illigal chars in the name. The program does not result in an exception like this. Also in your case the message is at least 65 in length, which is bit too long for a PV name.

Can you use server.setDebugLevel(10), so that it prints out detailed input requests.

hinxx commented 5 years ago

OK, looking at my collected log from casnooper.py, running over night, I can immediately spot the PV name that is too long:

hinkokocevar@TPhinxx ~/webpv/webpvioc $ echo CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo- | wc
      1       1      65

Another funny thing is that there are a lot of lines in the output that have '' for the PV name. How do these come about? Should I be worried? Example:

2018-12-05 22:40:16,300: isrc-lebt-ipc.tn.esss.lu.se:46469      <none>
2018-12-05 22:40:16,300: isrc-lebt-ipc.tn.esss.lu.se:46469      <none>
2018-12-05 22:40:16,300: isrc-lebt-ipc.tn.esss.lu.se:46469      <none>
2018-12-05 22:40:16,301: isrc-lebt-ipc.tn.esss.lu.se:46469      <none>

Will enable the debug output as suggested and report back.

xiaoqiangwang commented 5 years ago

How is the log from casnooper.py look like?

The empty PV name may come form some java clients, as explained here https://epics.anl.gov/tech-talk/2017/msg00723.php

hinxx commented 5 years ago

Here is the culprit. Your pointer about the too long PV name seems to be the right direction:

"gpn-cagw-02-ca.tn.esss.lu.se:51485" is searching for "LEBT-010:PBI-NPM-001:HCAM-SyncOut1Level"
CAS Request: ? on gpn-cagw-02-ca.tn.esss.lu.se:51485: cmd=6 cid=17777 typ=5 cnt=13 psz=48 avail=4571
"gpn-cagw-02-ca.tn.esss.lu.se:51485" is searching for "LEBT-010:PBI-NPM-001:HCAM-SyncOut1Level_RBV"
CAS Request: ? on gpn-cagw-02-ca.tn.esss.lu.se:51485: cmd=6 cid=17778 typ=5 cnt=13 psz=40 avail=4572
"gpn-cagw-02-ca.tn.esss.lu.se:51485" is searching for "LEBT-010:PBI-NPM-001:HCAM-SyncOut1Mode"
CAS Request: ? on gpn-cagw-02-ca.tn.esss.lu.se:51485: cmd=6 cid=17779 typ=5 cnt=13 psz=48 avail=4573
"gpn-cagw-02-ca.tn.esss.lu.se:51485" is searching for "LEBT-010:PBI-NPM-001:HCAM-SyncOut1Mode_RBV"
CAS Incoming: 128 byte msg from gpn-cagw-02-ca.tn.esss.lu.se:51485
CAS Response: cmd=0 id=0 typ=0 cnt=13 psz=0 avail=0 outBuf ptr=0x2783b98
CAS Request: ? on iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754: cmd=0 cid=34536 typ=1 cnt=13 psz=0 avail=0
CAS Request: ? on iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754: cmd=6 cid=107703 typ=5 cnt=13 psz=72 avail=1a4b7
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-�"
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 64: unexpected end of data
CAS Response: cmd=11 id=ffffffff typ=0 cnt=0 psz=64 avail=8e outBuf ptr=0x2783ba8
filename="../../../../../../src/ca/legacy/pcas/generic/st/casDGIntfOS.cc" line number=464
Internal failure unexpected problem with UDP input from "iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754"
ERR NPM:174 target - PV POSITION data error: 'NoneType' object is not subscriptable
ERR BCM:141 target - PV PULSE data error: 'NoneType' object is not subscriptable
ERR NPM:174 target - PV POSITION data error: 'NoneType' object is not subscriptable
Exception in thread CAServerThread:
RuntimeError: SWIG director method error. Exception Calling Python Code

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib64/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hinkokocevar/webpvsoftioc/webpvioc/server.py", line 80, in target
    self.server.process(0.1)
  File "/usr/lib64/python3.5/site-packages/pcaspy/driver.py", line 825, in process
    cas.process(delay)
SystemError: <built-in function process> returned a result with an error set

ERR NPM:174 target - PV XSIZE data error: 'NoneType' object is not subscriptable
ERR BCM:141 target - PV PULSESZ data error: 'NoneType' object is not subscriptable
ERR NPM:174 target - PV XSIZE data error: 'NoneType' object is not subscriptable
ERR BCM:141 target - PV PULSE data error: 'NoneType' object is not subscriptable
ERR NPM:174 target - PV COUNTER data error: 'NoneType' object is not subscriptable
ERR NPM:174 target - PV COUNTER data error: 'NoneType' object is not subscriptable
ERR BCM:141 target - PV PULSE data error: 'NoneType' object is not subscriptable
ERR BCM:141 target - PV PULSE data error: 'NoneType' object is not subscriptable
ERR BCM:141 target - PV PULSESZ data error: 'NoneType' object is not subscriptable
ERR NPM:174 target - PV PROFILESZ data error: 'NoneType' object is not subscriptable
ERR NPM:174 target - PV PROFILESZ data error: 'NoneType' object is not subscriptable
ERR BCM:141 target - PV PULSE data error: 'NoneType' object is not subscriptable
ERR NPM:174 target - PV SIGMA data error: 'NoneType' object is not subscriptable
ERR NPM:174 target - PV SIGMA data error: 'NoneType' object is not subscriptable
ERR BCM:141 target - PV PULSE data error: 'NoneType' object is not subscriptable
ERR NPM:174 target - PV YSIZE data error: 'NoneType' object is not subscriptable
ERR NPM:174 target - PV YSIZE data error: 'NoneType' object is not subscriptable
ERR BCM:141 target - PV PULSE data error: 'NoneType' object is not subscriptable

Actually it is the same PV from the message earlier, whose name seems to be too long.

hinxx commented 5 years ago

In the casnopper.py log I see these lines that explain the numerous different UDP packet issues as posted before (last char of the PV name that is too long) can vary depending on what it is in the memory at that location:

"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-P"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-P"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-P"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-P"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-P"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-P"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-P"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-P"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-P"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-P"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-P"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-P"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-P"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-P"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-P"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-"
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-�"
xiaoqiangwang commented 5 years ago

This PV name seems to be concatenated from several names CrS-TICP:Cryo CrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsg Nbr CrS-TICP:Cryo-. You could now check the gateway log to see which client has requested this PV.

hinxx commented 5 years ago

Ah, you are right.. I have not noticed that (not my IOC..). Thanks for the extensive help @xiaoqiangwang, I'll have our integrators and network people look into it.

hinxx commented 5 years ago

Could pcaspy be more cautious and check for the PV name length before or similar in order not to suffer from erroneous PV requests flying by?

xiaoqiangwang commented 5 years ago

I created a branch "non-utf-pv" to use latin1 codec if pv names are not utf8 encoded. It still prints the error message but should not result in exceptions. Can you give it a try?

hinxx commented 5 years ago

Casnooper.py from the non-utf-pv branch is running..

hinxx commented 5 years ago

No errors / exception for the whole weekend. I still see the long PV name:

"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-}"
2018-12-10 11:01:59,211: iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754   CrS-TICP:CryoCrS-TICP:Cryo-Virt-MSG1:Cmd_AckMsgNbrCrS-TICP:Cryo-}
CAS Request: ? on iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754: cmd=6 cid=1777824 typ=5 cnt=13 psz=32 avail=1b20a0
hinxx commented 5 years ago

And a test with the locally generated PV with too long name, also passes without errors/exceptions:

CA CLIENT:

(epics) -bash-4.2$ camonitor LRR:this-is-a-pv-with-the-long-name-that-is-quite-illegal-on-the-epics-network

SNOOPER:

2018-12-10 11:04:19,868: iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754   LRR:this-is-a-pv-with-the-long-name-that-is-quite-illegal-on-the
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "LRR:this-is-a-pv-with-the-long-name-that-is-quite-illegal-on-the"
2018-12-10 11:04:19,918: iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754   LRR:this-is-a-pv-with-the-long-name-that-is-quite-illegal-on-the
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "LRR:this-is-a-pv-with-the-long-name-that-is-quite-illegal-on-the"
2018-12-10 11:04:19,999: iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754   LRR:this-is-a-pv-with-the-long-name-that-is-quite-illegal-on-the
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "LRR:this-is-a-pv-with-the-long-name-that-is-quite-illegal-on-the"
2018-12-10 11:04:20,147: iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754   LRR:this-is-a-pv-with-the-long-name-that-is-quite-illegal-on-the
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "LRR:this-is-a-pv-with-the-long-name-that-is-quite-illegal-on-the"
2018-12-10 11:04:20,552: iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754   LRR:this-is-a-pv-with-the-long-name-that-is-quite-illegal-on-the
"iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754" is searching for "LRR:this-is-a-pv-with-the-long-name-that-is-quite-illegal-on-the"
2018-12-10 11:04:28,793: iscr-lebt-cagw-01-d2.tn.esss.lu.se:45754   LRR:this-is-a-pv-with-the-long-name-that-is-quite-illegal-on-the
xiaoqiangwang commented 5 years ago

A PV name no matter how long, if utf8 encoded, is not a problem. There is no limit imposed on the length of a PV name. I tried with PV name with 109 characters.

It is a problem with this specific client. It seems to limit PV name to 65 characters, but somehow the last character is filled with garbage.

In any case it is a good idea to guard against PV searches with illegal characters. I have merged the "non-utf-pv" branch to master.

xiaoqiangwang commented 5 years ago

Fixed in release 0.7.2.