paulscherrerinstitute / StreamDevice

EPICS Driver for message based I/O
GNU General Public License v3.0
28 stars 42 forks source link

IOC lockRequest messages and inconsistent record scanning #71

Open LeeHudsonDLS opened 3 years ago

LeeHudsonDLS commented 3 years ago

Hi, I'm in the early stages of characterising an issue I've been seeing but in case it's something obvious I thought I would raise it now. Our version of streamDevice is 2.8.18.

We have rolled this version of streamDevice out on a few IOCs and haven't had any issues except for on one IOC. The IOC in question monitors how long it takes to get a response from a device, it reports that some readback PVs aren't getting updated as fast as they were on an older version (2.5). The record in question and an "ai" record and has a scan rate of 0.2 seconds. By setting some asyn trace masks I was able to see exactly when the record was querying the hardware, below is the time in seconds between these queries:

2.8.18: avg 0.458166666666666 max 1.569 stdev 0.385062690171769

2.5: avg 0.342230769230769 max 0.777 stdev 0.220675811463744

As you can see there doesn't appear to be a big difference in the average but the max is double. It's also less consistent as shown in the stdev.

This particular IOC does also give the occasional error, on boot and during operation:

2021/03/24 14:26:57.778463 robot_serial BL04I-MO-ROBOT-01:CNTL_ERR_CODE lockRequest: status returned to normal 2021/03/24 14:26:56.867206 robot_serial BL04I-MO-ROBOT-01:DRYTMR lockRequest: status returned to normal 2021/03/24 14:26:56.867218 robot_serial BL04I-MO-ROBOT-01:DRY_SPEED lockRequest: status returned to normal 2021/03/24 14:26:56.867229 robot_serial BL04I-MO-ROBOT-01:DRY_TIME lockRequest: status returned to normal

I don't know if this is related.

It should be pointed out that i've seen this behaviour by only changing the streamDevice dependency.

I appreciate this isn't much to go on but I will investigate further, I just thought i'd raise the issue now in case you can see anything obvious.

Thanks

dirk-zimoch commented 3 years ago

I never had real-time in mind when writing this driver and did not pay much atttention on the performance values. To understand yout values better, please explain:

I wonder a bit that you see only "return to normal" messages but no actual error message. Need to check that.

Please provide full database, protocol files and sample data. Best send per mail directly to me.

LeeHudsonDLS commented 3 years ago

This IOC is using Serial

We have 146 records that can access this port but only 62 that have a regular scan rate.

That's a good point, the numbers aren't the time between record processes but the time between the communications triggered by the record processing. I got these values by setting asynSetTraceIOMask to report each command sent. As this is the time between messages sent from the IOC I don't think the device is the slow part, I see the subsequent response from the device in around 1ms (again using asynSetTraceIOMask).

I will send the IOC to you via email.

Sorry I missed some previous error messages:

`[Wed Mar 24 12:14:36 2021]2021/03/24 12:14:36.753586 main BL04I-MO-ROBOT-01:BCODE lockRequest: port robot_serial not connected

[Wed Mar 24 12:14:36 2021]2021/03/24 12:14:36.753613 main BL04I-MO-ROBOT-01:BCODE: Can't start @init handler

[Wed Mar 24 12:14:36 2021]2021/03/24 12:14:36.753622 main BL04I-MO-ROBOT-01:BCODE: Record initialization failed

[Wed Mar 24 12:14:36 2021]Bad init_rec return value PV: BL04I-MO-ROBOT-01:BCODE ao: init_record `

iocLog.txt

dirk-zimoch commented 3 years ago

The message "lockRequest: port robot_serial not connected" shows that pasynManager->queueRequest() fails when StreamDevice is starting the protocol. The error string "not connected" comes directly from the asyn driver. I see you are using drvAsynIPPortConfigure, so it is a TCP device, not serial. The message means that the TCP socket is not (yet) connected when communication is supposed to start. For that reason, all @init handlers have failed during iocInit as well as the first few processings. Only 3 seconds later, the asyn port connects and StreamDevice reports "status returned to normal". From that time on, everything works normally, from my point of view. I will have a look how to force (and/or wait for) connect in iocInit. But this will introduce timeouts for any device that is not online when the ioc starts. For the time being, you may try to add connect <timeout_ms>; to the beginning of your @init handlers.

LeeHudsonDLS commented 3 years ago

Thanks Dirk, thanks for looking into this and explaining the issue. We are using a terminal server over TCP to talk to a serial device. What is odd is this method of communication is extremely common at diamond and using the older version of stream device (2.5) we never see this error.

dirk-zimoch commented 3 years ago

I changed something in the connection handling, using pasynManager->exceptionCallbackAdd() now to get asynchronous connect/disconnect events (and call @init again when the device re-connects). I need to check if that degrades the behavior in iocInit when connecting is slow...

Anyway, this has nothing to do with the processing time changes.

dirk-zimoch commented 3 years ago

Comparing with my older code, I see that before relying on auto-connect, I used asynQueuePriorityConnect in the pasynManager->queueRequest() call when the device was not connected. This forced an connection attempt with a potential timeout in every communication attempt (I think). The idea behind auto-connect was to simply skip the device while it is disconnected and leave it to asyn to re-connect. But I see that this leads to problems at initialization. On the other hand, I want to avoid to time out as many times as there are @init handlers for the same device. Unfortunately the records do not really know which other records belong to the same device/asyn port. So I cannot say "wait only once per device".

LeeHudsonDLS commented 3 years ago

Hi Dirk, I don't know this module intimately but are you saying that using the asynQueuePriorityConnect would mask any slow to respond hardware by waiting for the connection rather than continuing without a connection? The error messages I get look worse than they are considering the device does connect shortly after.

dirk-zimoch commented 3 years ago

Using asynQueuePriorityConnect makes the I/O call wait for connect. That slows down the IOC startup until the device is connected instead of starting without a connection. But what if the device is offline and not just slow? That will slow down IOC startup a lot: TIMEOUT*recordswith@init.

And it slows down the port thread later when the device drops out, but you may not see that as the rest of the IOC will continue to run, only all records connected to the same port will get LockTimeout. However you may see that as the asyn thread hanging when the IOC is terminated.

As the @init should run once the device connects, everything should be fine. But I will see if I can do someting about it anyway. That may take some time though....

LeeHudsonDLS commented 3 years ago

OK thanks Dirk, now that I understand the errors I think I can run with them. The other issue is the performance but I guess that's potentially a rabbit hole...

dirk-zimoch commented 3 years ago

For that I would probably need access to such a device. Simply you telling me "something is slower" does not really allow me to track things down. What you can do: Build intermediate versions of StreamDevice and hunt down which commit caused the performance loss. Then I may understand what is going on. Maybe start with the latest 2.7 version and compare it to the first 2.8 version.

LeeHudsonDLS commented 3 years ago

OK thanks, i'll do that. Because i'm comparing to a very old version it may be the case that the performance has gotten worse over many commits but I will verify that. Thanks

LeeHudsonDLS commented 2 years ago

Apologies for the slow progress on this. I have narrowed the issue down to commit d7770846aa45c2d746450cc52a757ab03bf0fe99

The issue if present after this commit but not before it.