Closed 17-sugiyama closed 9 months ago
@BrianJKoopman a bit confused about your comment (I think it was me who suggested non-zero timeouts). Why would a timeout of zero help here?
I think the problem is the long acq
loop time: https://github.com/simonsobs/socs/blob/aaf066e974c0eaea3aff420ae9eb077977174213/socs/agents/hwp_pcu/agent.py#L193
While this is sleeping, the lock cannot be acquired by other processes. I think to fix we should make this a fast loop (.1 sec or so) and only take and publish data on iterations where 5 sec have elapsed.
Or we could also restructure the agent to avoid TimeoutLocks.
I think the problem is the long
acq
loop time:While this is sleeping, the lock cannot be acquired by other processes. I think to fix we should make this a fast loop (.1 sec or so) and only take and publish data on iterations where 5 sec have elapsed.
Ah, yup, agreed this 5 second sleep is the issue, the task timeouts would need to be longer than that. Sorry for the noise. I've just seen non-zero acq timeouts cause issues in the past, and the example in the docs has a 0 timeout in long running processes.
I realize it might not be clear what I'm talking about when I say lockless agent restructure, so I threw together what I have in mind in this draft PR for reference which should fix the locking issues. Feel free to use that fix if you would like, but may need some testing:
Thank you so much for your comments and suggestions. I prefer to use Jack's lockless agent. I tested Jack's agent with the PCU instrument and it worked. I left some comments to #600. I'd like to close this pull request.
I tested the recently added hwp_pcu agent, and found that 3 seconds of timeout is sometimes too short to communicate with the HWP PCU module. This is a request to extend timeout for sending commands/receiving messages.
Description
I extended the timeout for
send_command
from 3 sec to 10 sec; and forget_status
to 10 sec.Motivation and Context
Improving the operation stability
How Has This Been Tested?
I tested this agent in the lab and it worked. The recently added agent worked fine except for the timeout issue.
Types of changes
Checklist: