slaclab / pysmurf

Other
2 stars 9 forks source link

Retry on epics failure for cryocard commands #792

Open jlashner opened 4 months ago

jlashner commented 4 months ago

Describe the problem

Commands to the cryocard sometimes return None, signaling an epics timeout occurred. Unlike epics cagets in the smurf_command module, the cyrocard do_read function has no retry_on_fail, so if there is an epics failure, due to either hardware connection or server load, it will just return None instead of trying again.

Describe the solution you'd like

It would be nice if we could have the option to retry_on_fail for cryocard commands.

tristpinsm commented 4 months ago

Hi Jack, I've been looking at CryoCard.do_read and it looks like as it is now it will retry, up to 5 times by default, when trying to read from a given address. So I'm wondering if there is somewhere else where this issue may be coming from? Do you have an example of a command that times out?

Also, looking at that code I'm not sure how it behaves if an epics timeout does occur...

#need double write to make sure buffer is updated
self.writepv.put(cmd_make(1, address, 0))
for self.retry in range(0, self.max_retries):
    self.writepv.put(cmd_make(1, address, 0))
    data = self.readpv.get(use_monitor=use_monitor)
    addrrb = cmd_address(data)
    if (addrrb == address):
        return (data)
return (0)

return (self.readpv.get(use_monitor=use_monitor))

My understanding is that a timeout would result in PV.get returning None, which should then raise an exception when the cmd_address tries to interpret it as an int. (also noting the unreachable return statement at the end)

jlashner commented 4 months ago

Ya I think that's what I determined as well looking at this closer... the retry is failing because cmd_address cannot handle None inputs.

jlashner commented 4 months ago

One such failure is documented here: https://github.com/simonsobs/daq-discussions/discussions/91