Open philipstarkey opened 7 years ago
Original comment by Ian B. Spielman (Bitbucket: Ian Spielman).
This is to allow worker processes to reside on an arbitrary computer so that the camera, or the NI cards, or what have you can be on different physical machines, but the blacs front end will still live on a master computer (that might have no actual hardware attached at all)
Original comment by Philip Starkey (Bitbucket: pstarkey, GitHub: pstarkey).
Sounds good. I'd like to request that the kwarg is called something other than host
so that it can be easily differentiated from whatever we standardise for secondary instances of BLACS. Something like worker_host
or similar?
Original comment by Philip Starkey (Bitbucket: pstarkey, GitHub: pstarkey).
Related proposal: https://labscript-suite-temp.github.io/labscript-suite-bitbucket-archive/#!/labscript_suite/blacs/issues/13/launch-blacs-as-a-secondary-control-system (labscript-suite-temp/blacs#13)
Original comment by Ian B. Spielman (Bitbucket: Ian Spielman).
I looked at this second proposal, and while related, it attempts to solve a different problem where one wants a GUI on a different machine. The proposal here is different that the worker processes of one blacs instance should be spawned on a different computer and the interprocess communication will take place using the current model (ZMQ + h5 files). So really the only "new" feature is launching the processes on a different network accessible computer and initializing the communication.
Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd).
Ok so I gave this a try and here is what I came up with:
The Code is tested and works, but there are still a few problems though, that I could need help with before creating a Pull Request:
I was not able to extract the ports from zprocess without modifying it so I relay all messages over the WorkerServer. This might slow things down. Does anyone know how to solve this in a better way?
BLACS loads for a long time if the WorkerServer is not running as it waits the timeout for each device. If you however chose the timeout to low BLACS will not wait until the Worker and relay Threads are created.
Transition to buffered passes the local file path in relation to the BLACS system and the remote Worker expects a local path in relation to it's system. This is a problem if the shared drive has a different Path on both systems.
Is there a better place than the ConnectionTable Properties to store worker_host?
The Port is currently hardcoded should this move to the Connection Table(editable for each device) or LabConfig(one port for all devices and servers)?
Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).
Thanks for the work! Yep, I will take a look soon.
Original comment by Fan Yang (Bitbucket: fanyangphy).
Hello,
Not sure if I'm reinventing the wheel, but here's how I implemented remote device worker launch / control:
In zprocess
there's already RemoteProcessClient
and RemoteProcessServer
classes half-implemented. What seems to be missing is that in the function Process.subprocess()
, if there's a remote process client specified, some sort of proxy tunnel between the client and the server should be established and the parent / heartbeat / broker host and port info being passed to the worker should be replaced by the proxy port established by the remote process server. I took liberty to implement the proxy server / client as a pair of zeromq dealer-router. I also have the proxy server and client listen for heartbeats from the workers so unused proxy tunnels will be closed if the worker terminated. See zprocess:proxy-support. The Server can be started by executing python __main__.py <tui>
under zprocess/remote/
, where the option tui
will give you a curses-based ui for monitoring connected remote process clients, associated processes and proxy tunnels.
While the remote process server is running on the remote computer where devices are physically present, I made the following changes in BLACS/__main__.py
and labscript.py
:
Device.__init__()
, a kwarg remote_device
is expected to be a dictionary containing host and port info of the RemoteProcessServer
and saved to the connection table.Tab.__init__()
will take this information to instantiate RemoteProcessClient
.DeviceTab.create_worker()
instantiates the worker class with the remote process client passed, which basically calls Process.subprocess() to start the connection with the remote process server.To set up a remote device, supply kwarg of the form remote_device={'host': <host>, 'port': <port>, 'proxy_port': <port>}
to the device entry in connectiontable.py
, start the remote process server on the remote computer, open the ports (7340 and 7440 are the defaults) in the firewall, and it should be ready to go.
The rationale behind having a zeromq proxy server is that it simplifies the network setup: only two ports need to be configured in the firewall of the remote process server: the "control port" which is the good old ZMQServer port, and the "proxy port" that is the port of the zeromq router-dealer pair. It seems tricky to get the ip address of the client connecting to a zeromq socket as they don't expose that information in the API, and even if we could it won't be of much use if the client lives behind NAT. So a zeromq proxy seems like the best bet.
One problem that I have't fixed is that the worker would need to have read (and write, if it's an acquisition device) access to the h5 file during transition to buffer and transition to manual. My idea would be to replace the filename with a python file-like object. This file-like object can be a custom class made to access the file through the zeromq proxy tunnel.
Basically the client end would look like:
#!python
class RemoteBytesIO(object):
def __init__(self, socket):
self.socket = socket
def __getattr__(self, name):
if name not in ['read', 'write', 'seek', 'tell', 'truncate', 'flush']:
raise AttributeError
def func(*args, **kwargs):
if name == 'write':
args = (bytearray(args[0]),)
self.socket.send_pyobj([name, args, kwargs])
if self.socket.poll(timeout=1000) == 0:
raise TimeoutError
retval = self.socket.recv_pyobj()
return retval
and the server end would look like:
#!python
with io.open(fname, 'r+b') as f:
while True:
try:
cmd, args, kwargs = socket.recv_pyobj()
print(cmd, args, kwargs)
result = getattr(f, cmd)(*args, **kwargs)
print(result)
socket.send_pyobj(result)
except KeyboardInterrupt:
break
This approach does not require copying the whole file back and forth, which can be a problem if there are more than a couple remote devices.
The code has been tested but may need to be cleaned up a bit.
Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).
Hi Fan,
Thanks for all the work! I'll have to look over how what you've done works, or at least the zprocess side of things. You've clearly understood what I was going for there, so it looks like this ought to be compatible with our expectations/plans.
The use of proxy sockets is interesting - I had thought long and hard about how to route all the traffic for a remote process through a single zmq socket and decided it was basically impossible to do what we wanted in that way, even though it would be preferable for e.g. NAT usage. So I will read over your implementation and see if I was wrong about that, or if there is some crucial functionality I was wanting that can't be done if using a proxy. I forget my reasoning, or why I thought it was impossible, but I'm sure reading your code will remind me.
The curses interface is a nice touch! I should do that for the other servers like the locking server...
Regarding file I/O, the labscript programs already can run on separate computers, and only exchange filepaths rather than files. The files are expected to be on a network drive, and the filepaths are stripped of the network drive prefix (i.e. drive letter) before sending, with the prefix added by the recipient upon receiving (see labscript_utils.shared_drive). We ensure no two clients have the file open in write mode simultaneously using a network locking server (see labscript_utils.h5_lock and zprocess.locking).
If configuring all computers to use a shared drive (which may be located on one of them) has any major drawbacks, I might be in favour of including network file I/O in the labscript suite, but so far the shared-drive approach has worked well for us, so I think I don't see a great need for this at the moment.
I can't promise when I'll get a chance to look over this, though I might in the next few days (especially since the US government shutdown means I can't go to the lab). Feel free to make a pull request for the zprocess changes, and we can discuss it further there. We should get the zprocess implementation solid first since it is the underlying layer for the other bits.
Thanks again for contributing!
Original comment by Fan Yang (Bitbucket: fanyangphy).
Hi Chris,
I am glad to contribute! We (Lev lab at Stanford) have recently deployed labscript on one of our machines and we are quite happy about it. So I thought we should also contribute.
Regarding the proxy socket, the zeromq router socket is magical! Basically if you have client sockets connect to it and declare their identities (which is just a bytestring you get to pick), then on the proxy server side, if you send a multipart message to the router socket with the first part being a bytestring matching the identity of one of the connected clients, this message will automatically get routed to that client. Similarly, if you receive a message from the router socket (coming from one of the clients), the router socket will prepend it with the client identity and deliver it to you. You can then look up which local sockets correspond to this client and relay the message. This is how you keep track of multiple clients on the remote process server.
Although I claimed that this should work with NAT, I have not actually tested the use case. I guess I'll test it and report back.
Regarding file IO, I did not know the bit about drive prefix. Seems like a decent solution, if this h5_lock is in place to serialize read / write (which I did not know either).
I'll submit a pull request soon.
Original report (archived issue) by Ian B. Spielman (Bitbucket: Ian Spielman).
All blacs devices should accept a "host" kwarg allowing that device to be launched on a suitably configured remote computer. Currently desired for python based camera devices, but nice in many cases.