simonsobs / socs

Simons Observatory specific OCS agents.
BSD 2-Clause "Simplified" License
12 stars 12 forks source link

TimeoutLockless PCU agent #600

Closed jlashner closed 6 months ago

jlashner commented 6 months ago

I realize it may not have been clear what I was saying about restructuring the PCU agent to avoid timeout locks since we still don't have any examples, so thought I'd throw it into a branch here for people to see and use if they'd like.

Description

Restructure HWP PCU to use new lockless agent structure. I think this is a easier way to structure the agent, and should solve all of the issues that have to do with lock-acquisition, and generally make the agent more threadsafe.

Motivation and Context

Solves locking issues, and serves as an example for timeout lockless agents

How Has This Been Tested?

Not tested

Types of changes

Checklist:

17-sugiyama commented 6 months ago

I added the 'hold' option to the process_action. The combination of on/off has been changed -- Kyohei found more efficient operation to stop the HWP rotation. https://github.com/simonsobs/socs/blob/61d971fda05564bd84a90787c813efd79ebe9fe2/socs/agents/hwp_pcu/agent.py#L41-L43

17-sugiyama commented 6 months ago

Many thanks for updating the pcu agent! I tested the send_command function with the pcu instrument and it worked as expected. However, _get_and_publish_data does not seem to work. Nothing is published, but there is no error message.

The pcu uses a relay module and I have defined get_status according to the sample code: https://github.com/numato/samplecode/blob/master/RelayAndGPIOModules/USBRelayAndGPIOModules/python/usbrelay1_2_4_8/relayread.py This command does not seem to be very stable and often takes a long time. The problem with _get_and_publish_data may be due to get_status being unstable. So I recommend to publish the status as the last sent command. I would like to have your opinion.

jlashner commented 6 months ago

Hi Junna, no problem!

I do think reading the status from the relays instead of using the last sent command is very important. It will make the data from the agent a lot more trustworthy, especially in cases where:

I suggest we invest some time to understand why the get_status function is not behaving properly.

17-sugiyama commented 6 months ago

I got the importance of reading the status. I made _get_and_publish_data work by increasing the sampling rate to every 30 sec. https://github.com/simonsobs/socs/blob/61d971fda05564bd84a90787c813efd79ebe9fe2/socs/agents/hwp_pcu/agent.py#L130C1-L130C35 This is the published status when I set the operation mode as hold --> off --> on_1 --> hold.

  | 2023-12-21 14:25:00 | hold
  | 2023-12-21 14:24:00 | hold
  | 2023-12-21 14:23:00 | hold
  | 2023-12-21 14:22:00 | failed
  | 2023-12-21 14:21:00 | failed
  | 2023-12-21 14:20:00 | failed
  | 2023-12-21 14:19:00 | on_1
  | 2023-12-21 14:18:00 | on_1
  | 2023-12-21 14:17:00 | failed
  | 2023-12-21 14:16:00 | failed
  | 2023-12-21 14:15:00 | off
  | 2023-12-21 14:14:00 | hold

Although it takes a few minutes to obtain the correct status immediately after changing the operating mode, the response becomes stable thereafter. If you are comfortable with this sample rate, this may be the solution.

jlashner commented 6 months ago

Thanks Junna, this is much better! Do you think we can log the "response" in the cases where it's failing so we can get a better idea why? Other than that I think this is fine to merge. It may be worth coming back later and figure out how to make get_status more reliable though.

ykyohei commented 6 months ago

Actually, I was testting reliable get_status, and I could do it by the following script. It also takes more time to send command but acceptable, I think. For some reason, I needed to switch all 8 relays to get_status correctly.

I'm happy to implement more reliable get_status later. But I would like to use this agent for satp3 observations ASAP.

import serial
import argparse
import time
import numpy as np

patterns = {
    'off':  [0, 0, 0, 0, 0, 0, 0, 0],
    'on_1': [1, 1, 1, 0, 0, 1, 1, 1],
    'on_2': [1, 1, 1, 0, 0, 1, 0, 0],
    'stop': [0, 1, 1, 0, 0, 1, 0, 0],
    'hold': [1, 1, 1, 0, 0, 1, 0, 0],
}

class pcu:
    def __init__(self, port="/dev/ttyACM0"):
        self.PCU = serial.Serial(
            port=port,
            baudrate=19200,
            timeout=1,
            bytesize=serial.EIGHTBITS,
            stopbits=serial.STOPBITS_ONE
        )

    def close(self):
        self.PCU.close()

    def wait(self):
        time.sleep(.1)

    def read(self):
        msg = self.PCU.read_until(b'\n\r').strip().decode()
        out = self.PCU.read_until(b'\n\r').strip().decode()
        return msg, out

    def get_status(self):
        self.PCU.write(str.encode('relay readall\n\r'))
        self.wait()
        out = int(self.read()[1], 16)
        print('pattern', [(out >> i & 1) for i in range(8)])
        for k, v in patterns.items():
            v = np.sum([vv*(2**i) for i, vv in enumerate(v)])
            if out == v:
                print('current state is '+k)
                return
        print('current state is undefined')

    def test(self, command):
        pattern = patterns[command]
        for j, i in enumerate(pattern):
            self.PCU.write(str.encode("relay "+{0:'off', 1:'on'}[i]+' '+str(j)+"\n\r"))
            self.wait()
            print(self.read())
        self.wait()
        print(command + ' is done')

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('command', type=str,)
    args = parser.parse_args()
    pcu = pcu()
    pcu.test(args.command)
    pcu.get_status()
    pcu.close()
ykyohei commented 6 months ago

I guess the reason why the current get_status sometimes fails because there is exception for the

response = self.port.read(25)
response = response.decode('utf-8')

I'm not sure if this is the best way, but following method in my script worked as long as I tested.

def read(self)
     msg = self.PCU.read_until(b'\n\r').strip().decode()
     out = self.PCU.read_until(b'\n\r').strip().decode()

self.PCU.write(str.encode('relay readall\n\r'))
out = int(self.read()[1], 16)
ykyohei commented 6 months ago

@jlashner Can we merge this branch now? I do understand the importacne of get_status, but I would like to start using this agent in the satp3 observation over the holidays. Start using this agent is really important step for the observation using HWP now. I will be carefully watching the satp3 observation until we implement the get_status.

jlashner commented 6 months ago

Hi kyohei, yep! Totally fine with merging this now and then figuring out remaining issues after vacation. Good luck!