slaclab / pysmurf

Other
2 stars 9 forks source link

fix(client.command.cryo_card): Catch EPICS timeout on read and retry. #794

Open tristpinsm opened 2 months ago

tristpinsm commented 2 months ago

Description

Addresses issue #792

Jira Issue

N/A

Tests done on this branch

It's a simple enough fix that that testing didn't seem necessary.

Function interfaces that changed

N/A

What was the interface before the change

N/A

What will be the new interface after the change

N/A

jlashner commented 2 months ago

I think we may want to test this... PVs in the smurf_command module use a 10 second timeout, so a retry happens roughly every 10 sec. We may want to implement the same thing here; as things are now I believe retries will happen immediately which I don't think we want.

tristpinsm commented 2 months ago

OK, I've added a timeout between restarts and I can try it out on a test system at SLAC this afternoon. Just to confirm that it is retrying as expected when it can't reach the cryo card. However I'm not sure how to determine the appropriate number of retries or time in between. Currently it's set to retry 5 times, at 5s intervals.

tristpinsm commented 2 months ago

Alright so I've tested CryoCard.read_temperature with no cryo card connected and it's retrying as it should:

[ 2024-05-17 19:25:19 ]  CryoCard.do_read waiting 5s before retry 2 / 5.
[ 2024-05-17 19:25:24 ]  CryoCard.do_read waiting 5s before retry 3 / 5.
[ 2024-05-17 19:25:29 ]  CryoCard.do_read waiting 5s before retry 4 / 5.
[ 2024-05-17 19:25:34 ]  CryoCard.do_read waiting 5s before retry 5 / 5.

However, after those attempts it returns a nonsense value, which is just how the code is designed.

jlashner commented 2 months ago

I'm confused how this is working, shouldn't the continue bypass the sleep and log? Maybe this is an instance where it does not timeout but the returned address does not match.

I think instead of manually sleeping, it makes more sense to set the timeout in the epics get fn, like here: https://github.com/slaclab/pysmurf/blob/1ff66325bf174f975c241b5cf6922f378dd58b20/python/pysmurf/client/command/smurf_command.py#L208-L211 5 sec timeout with 5 retries sounds good to me as a default!

tristpinsm commented 2 months ago

I'm confused how this is working, shouldn't the continue bypass the sleep and log? Maybe this is an instance where it does not timeout but the returned address does not match.

yeah, that's a mistake...

I think instead of manually sleeping, it makes more sense to set the timeout in the epics get fn, like here:

https://github.com/slaclab/pysmurf/blob/1ff66325bf174f975c241b5cf6922f378dd58b20/python/pysmurf/client/command/smurf_command.py#L208-L211

5 sec timeout with 5 retries sounds good to me as a default!

ok

jlashner commented 1 month ago

This is looking good to me... one more thing though, I think we should raise an exception on failure instead of just returning 0. Can we add that here?

tristpinsm commented 1 month ago

This is looking good to me... one more thing though, I think we should raise an exception on failure instead of just returning 0. Can we add that here?

Yeah, I agree that makes a lot more sense. I wonder if that change may break other code that is relying on the current behaviour though. I guess I can have look and see.