Implemented UDP socket utility function for Nodes

studywolf commented 6 years ago

Motivation and context:

Communication between Nengo nodes locally or over a network using UDP sockets. Originally and almost entirely written by @xchoo, I added a few fixes for bugs that crop up when testing non-local communication and documentation.

How long should this take to review?

Quick, it's mostly been reviewed in https://github.com/nengo/nengo/pull/441

Types of changes:

Added nengo_utils/sockets.py and nengo_utils/tests/test_sockets.py

Checklist:

[x] I have read the CONTRIBUTING.rst document.
[n/a] I have updated the documentation accordingly.
[n/a] I have included a changelog entry.
[x] I have added tests to cover my changes.
[x] All new and existing tests passed.

tbekolay commented 6 years ago

Pushed a commit with some refactoring. The tests pass for me locally -- except for once when it the time_sync test failed. Likely thread related so hard to reproduce, but it would be annoying to have to restart TravisCI jobs when the test randomly fails... perhaps the time sync could be made more robust? How else can we test this @xchoo?

xchoo commented 6 years ago

Hmmm. I'm not sure. The problem is that two independent nengo models are required to test the time_sync functionality. We might be able to test it in one model, but that might also just cause the entire model to deadlock. I suppose you can drop that test for now until we figure out a more robust solution?

xchoo commented 6 years ago

Does this support IPv6? Do we want to support IPv6?

No, I don't think it supports IPv6. But since it uses the python socket class, if that supports IPv6, then so should this class (with no changes necessary)

When a class has a close method, it is an indicator to me that it maybe deserves to be a context manager (i.e. be usable with with). Is there a reason to not make UDPSocket a context manager?

Perhaps? I will leave it to someone with more expertise with context managers to decide.

To me it seems more logical to send the actual timestamp instead of t + dt / 2 and leave it to the receiver to verify it.

The timestamp send code was written to work unambiguously with the nengo receive code, and arduino receive code written with the same logic (i.e. t >= T and t < T + dt). The change was made to t + dt/2 because the original implementation of t was giving problems with rounding. If someone would like to retest the t implementation with different hardware, by all means! 😄

What is the timeout logic supposed to do? And what is it supposed to be? At the moment the logic seems to be distributed across multiple places which makes it hard to grasp.

It's complicated. There are multiple levels of timeouts happening.

The initial socket creation could time out
Waiting for a packet could time out
Receiving a packet with non-current data, and deciding when to terminate the socket because consistent past data is being received.

And I'm sure there are other things in there as well.

tbekolay commented 6 years ago

Pushed a commit refactoring this into three separate processes. I'll respond to the inline questions given the new code, but it's a big change so it's worth people looking at this again. In particular, I removed the ability to connect to multiple ports in the same connection (make more processes if that's desirable), and used numpy instead of struct to do byte conversions. Also, we previously did not bind the send socket, but in this version I do. The unit tests pass consistently, but without other examples it's hard to know if I've changed anything critical.

One other comment: do we actually want the ignore_timestamps argument? It seems to add little yet complicate the code a fair bit. I'd favor dropping it if there isn't a strong argument for having it.

tbekolay commented 6 years ago

Oh also, if anyone's curious why I added a shutdown before close, see this blog post: https://engineering.imvu.com/2014/03/06/charming-python-how-to-actually-close-a-socket-by-calling-shutdown-before-calling-close/

jgosmann commented 6 years ago

Oh also, if anyone's curious why I added a shutdown before close

👍 I remember coming across this when I refactored the Nengo GUI server

tbekolay commented 6 years ago

The timeout stuff still confuses me.

I'm a bit confused about it as well... but presumably @xchoo put it in for a reason? My current thinking on why it's necessary is that, without the timeout thread, the close methods don't get called until garbage collection. Doing it manually in a model is not easy either, as the step function returned by the process is not easy to access once the model is built. I started working on a PR for Nengo to expose that more easily, but I abandoned it because it was too complex and did not result in nice model code (which is also why doing this as a context manager does not work well, fyi). The timeout thread will at least ensure that the sockets will be shutdown and closed when the model has not been run for a while. Of course, running this in a GUI setting in which you want to pause and play a few times won't work well. But resetting the model and running again should work.

I would expect the timestamps to match the actual timestamp the data was recorded at

I don't really care that much about the timestamp, as long as it works. Feel free to change it and see if it works.

jgosmann commented 6 years ago

Spend some time looking at the timeout logic. Let's see whether I understand this correctly:

The socket needs to be closed after usage, but the nengo.Process has no way of knowing when a simulation ended. Thus, the SocketCloseThread is created to shutdown the socket when there haven't been any calls to recv or send (which is equivalent to calls to the step function created by the Process) for some time.
The SocketCloseThread might time out early and close the socket prematurely (either because a simulation time step took very long or the simulation was “interrupted” and continued with another call to sim.run(t). Thus, we reopen the socket in this case. But why is this only done for a receiving and not for a sending socket?
If reopening a socket, it will be tried repeatedly until it succeeds with increasing wait times (“backoff”). It seems to me that with the proper shutdown using socket.shutdown before socket.close should be much less of a problem. Maybe that code isn't required anymore?
A timeout is set on the blocking operations of the socket. This timeout is not handled when sending data (why?). When receiving data, it will be tried to receive data again. So it seems that the timeout for the socket could be made infinite instead, except that the SocketCloseThread timeout needs to be reset to prevent the socket from being closed. However, the socket timeout is set to the max timeout of SocketCloseThread, so it is likely that the socket gets closed anyways and needs to be reopened.

Assuming my understanding is correct, it feels to me that there should be (might be?) a better solution. But as a first step, it would probably best to agree how long the socket should be alive in an ideal world (and then see if that can be implemented):

The current implementation tries to have the socket alive for the duration the sim.run method is executed.
But maybe the life of the socket should be tied to the simulator (opened when building the model, closing with sim.close)? Of course currently there is no way for the nengo.Process to know about the close and implementing this might be complicated as @tbekolay says. Though, something similar (in a more general fashion) has been requested before (nengo/nengo#1118).
Maybe the user itself should manage the lifecycle of the socket? This seems slightly problematic as there is no way to properly close the socket in the Nengo GUI and even in a Python script it could mean wrapping most of it in a with statement which doesn't seem nice either.

tbekolay commented 6 years ago

But why is this only done for a receiving and not for a sending socket?

That's how it was originally implemented, so I maintained that same logic. Perhaps it should happen for both? Originally, also, the sending socket was not bound to a particular address and port, in part because it used to be possible to send to multiple ports (which didn't really make much sense). I removed that, which meant that I bound the socket once opened, which perhaps also necessitates doing the reopening? On the other hand, sending to a socket should never time out because in the worst case you're filling up some remote buffer, whereas receiving can easily time out if the process you're receiving from is slower than you, or blocks for some other reason.

Maybe that [backoff] code isn't required anymore?

Retrying with increasing backoff is a pretty standard technique in networking, so I would be hesitant to remove it (see, e.g., http://docs.aws.amazon.com/general/latest/gr/api-retries.html).

However, the socket timeout is set to the max timeout of SocketCloseThread, so it is likely that the socket gets closed anyways and needs to be reopened.

Again, I was attempting to keep the same logic as was there before. @xchoo or @studywolf will have to speak to why this happens.

how long the socket should be alive in an ideal world

I don't think there's a right answer here, different use cases will have different ideals. Leaving things in the hands of the user might work, but has similar technical difficulties as registering onclose events -- namely, the step function returned by make_step is not stored by the builder, and it's non-trivial to have it be stored and accessible.

tbekolay commented 6 years ago

The nengo_extras repo is designed for experimental (not necessarily fully baked) features. This PR has tests that are passing, which is better than 90% of software out there, so I'm not sure that we should block this, given that there are people (namely @studywolf) who want to use this in their models right now. We can always make a PR to modify this later. There are many things in this repository that haven't gotten the scrutiny that this has, so I'm not sure why this is being blocked when those have been merged without discussion?

xchoo commented 6 years ago

Just replying to some of @jgosmann comments.

The socket needs to be closed after usage, but the nengo.Process has no way of knowing when a simulation ended. Thus, the SocketCloseThread is created to shutdown the socket when there haven't been any calls to recv or send (which is equivalent to calls to the step function created by the Process) for some time.

Yes. Almost. The sending sockets only need to know when the local simulator has terminated. However, the receiving sockets need to know when the remote simulator has terminated. More on this later.

The SocketCloseThread might time out early and close the socket prematurely (either because a simulation time step took very long or the simulation was “interrupted” and continued with another call to sim.run(t). Thus, we reopen the socket in this case. But why is this only done for a receiving and not for a sending socket?

The reason why sending sockets don't need to be reopened is because they know when the data stream has ended (because the local simulator generates the data). However, receive sockets are blocking (i.e., they "hang" the socket thread until a packet is received). This being the case, some kind of logic has to be implemented to "figure out" when the remote side has stopped sending data. It is possible to have specialized "TERM" data packets, but, being UDP, there is no guarantee that any packets get to the destination. Then we'd have to do some kind of ACK/SYN logic which is a whole other can of worms (it'll be like implementing TCP in UDP)

If reopening a socket, it will be tried repeatedly until it succeeds with increasing wait times (“backoff”). It seems to me that with the proper shutdown using socket.shutdown before socket.close should be much less of a problem. Maybe that code isn't required anymore?

Iirc, the backoff logic was a "try X number of times" to make sure that the remote end is really terminated. The code is still required because the receiving sockets have no information about what the remote side is doing.

A timeout is set on the blocking operations of the socket. This timeout is not handled when sending data (why?). When receiving data, it will be tried to receive data again. So it seems that the timeout for the socket could be made infinite instead, except that the SocketCloseThread timeout needs to be reset to prevent the socket from being closed. However, the socket timeout is set to the max timeout of SocketCloseThread, so it is likely that the socket gets closed anyways and needs to be reopened.

The timeout is necessary on receiving sockets because they are blocking sockets. If no timeout is implemented, they block indefinitely (which is not ideal). Send sockets dont have timeouts because their primary function is to fire off a packet when it gets the information to.

xchoo commented 6 years ago

Imho, the PR seems to be a good rework of the socket code, but with all code that interacts with hardware, i'd like to see it tested in a "live" environment before giving it the thumbs up. 😄 Tagging @studywolf because he's currently using the code in such an environment (and I'm currently busy with stuff to test it out for myself).

jgosmann commented 6 years ago

We can always make a PR to modify this later.

And it might never happen given effort to understand the intent of the code in the first place and even then it isn't entirly clear without input of the original authors. Considering the amount of time I spend on this now, it is much more efficient to make those things clear in a timely manner than sometime later, when all of us have forgotton again how the parts of the code interact.

I am probably fine with keeping the timeout thread (once intended logic is made clear!) as there seems to be no other way in current Nengo releases to close the socket (and it might take some discussion to implement such a way).

xchoo commented 6 years ago

I am probably fine with keeping the timeout thread (once intended logic is made clear!) as there seems to be no other way in current Nengo releases to close the socket (and it might take some discussion to implement such a way).

Even if there was a way for nengo to close sockets, some kind of timeout logic is required because receiving sockets have no information about the state of the remote side. And it needs this information in order to tell the local simulator when or when not to shut itself down.

jgosmann commented 6 years ago

@xchoo first of all thank you for taking the time trying to answer my questions.

I now realized that there isn't a timeout on sending sockets (this is quite easy to miss, not sure if that can be made more obvious). But that leaves me with the question, how is the sending socket ever closed? This seems to only happen when the SocketStep gets garbage collected (due to Python's reference counting this should happen when the Simulator object is deleted). That means the sending socket is probably kept around longer than it is needed, but it doesn't seem like a huge problem to me.

Iirc, the backoff logic was a "try X number of times" to make sure that the remote end is really terminated. The code is still required because the receiving sockets have no information about what the remote side is doing.

I am not sure how that would tell you anything about the remote side. Whether the receiving socket can be opened should only depend on the local system, shouldn't it?

The timeout is necessary on receiving sockets because they are blocking sockets. If no timeout is implemented, they block indefinitely (which is not ideal).

Would the recv call return if the socket is closed from another thread?

xchoo commented 6 years ago

how is the sending socket ever closed?

In the original code, the sending sockets are closed when the simulator is terminated (garbage collected i think), or when the close method is manually called. There are instances where this fails, which is why in the original code, when the open method is called, it attempts to close the original socket.

I am not sure how that would tell you anything about the remote side. Whether the receiving socket can be opened should only depend on the local system, shouldn't it?

Yes. Whether it can be opened should be dependent on the local system. Whether it should be reopened is dependent on the remote system.

Would the recv call return if the socket is closed from another thread?

I think so, but it doesn't really matter if it returns or not. If the terminate call is made, the socket will terminate.

tbekolay commented 6 years ago

@xchoo first of all thank you for taking the time trying to answer my questions.

Cool no problem dude. Seriously the refactoring took a lot of time and effort and I think it's a lot more understandable than before.

xchoo commented 6 years ago

Thanks @tbekolay! 😄

jgosmann commented 6 years ago

Let's also try again whether my adjusted mental model captures the intended logic:

We don't really care about closing sockets as soon as they are not needed anymore (because Nengo currently has no good support of implementing something like that). They get closed (at latest) when the Simulator object gets deleted.
The recv call is blocking, but the remote end might not provide any further data. Thus, the recv call must time out. If it times out, we assume the package to be lost, keep the last received value, and continue listening in the next timestep.

This seems like much simpler logic than what is actually implemented. So what am I missing? Especially, this does not seem to require the whole SocketCloseThread.

xchoo commented 6 years ago

We don't really care about closing sockets as soon as they are not needed anymore (because Nengo currently has no good support of implementing something like that). They get closed (at latest) when the Simulator object gets deleted.

The socket should be opened when the simulation starts, and terminated when the simulation ends.

The recv call is blocking, but the remote end might not provide any further data. Thus, the recv call must time out. If it times out, we assume the package to be lost, keep the last received value, and continue listening in the next timestep.

Yes. But you'll have to deal with possible multiple lost packets (once again, UDP makes no guarantee about packets actually getting to the destination), and possible out of order packets (also, this is not guaranteed by UDP). Also, when the recv socket stops receiving packets, you can do the "timeout and wait for the next packet" approach, but this really slows down the local simulation (if the receive socket requires a timeout for the simulation to progress to the next timestep, then the simulation time for 1 dt becomes the socket timeout time (which could be as high as multiple seconds). That's why the timeout code is there, if the recv socket believes that the sending side has terminated, it closes the local socket, and just uses the last known value to finish the rest of the simulation.

tbekolay commented 6 years ago

That's why the timeout code is there, if the recv socket believes that the sending side has terminated, it closes the local socket, and just uses the last known value to finish the rest of the simulation.

I'm not sure if that's what happens (or has happened in any version of the code) because we always keep trying; there's no code path in there for giving up on the recv_socket, though there of course could be.

xchoo commented 6 years ago

I'm not sure if that's what happens (or has happened in any version of the code) because we always keep trying; there's no code path in there for giving up on the recv_socket, though there of course could be.

Ohh. Is there not? There should be. I might have been thinking about local changes that I didn't push to the main repo. Whoopsy! Networking code is hard. 😆

jgosmann commented 6 years ago

The socket should be opened when the simulation starts, and terminated when the simulation ends.

As far as I understand the current implementation, this is only happening for the receiving socket (with the SocketCloseThread) but not the sending socket (no timeout, so no SocketCloseThread).

possible multiple lost packets

Yes, so we would time out in each time step, wouldn't we? Isn't that what the code is doing right now anyways?

possible out of order packets

I don't think we are really dealing with that, we just take the next package with a timestep equal or larger to the current local timestep. (Or I guess, it is a very simple way of dealing with it, and probably sufficient for now.)

but this really slows down the local simulation (if the receive socket requires a timeout for the simulation to progress to the next timestep, then the simulation time for 1 dt becomes the socket timeout time (which could be as high as multiple seconds)

As far as I can tell that is exactly the current implementation.

That's why the timeout code is there, if the recv socket believes that the sending side has terminated, it closes the local socket, and just uses the last known value to finish the rest of the simulation.

Nope, the socket gets reopened.

Ohh. Is there not? There should be. I might have been thinking about local changes that I didn't push to the main repo. Whoopsy! Networking code is hard. 😆

That's why I am being so annoying about this and really want to understand what the code is supposed to do and what it is actually doing. 😸

jgosmann commented 6 years ago

That's why the timeout code is there, if the recv socket believes that the sending side has terminated, it closes the local socket, and just uses the last known value to finish the rest of the simulation.

If that is the intent, it seems much easier to keep track of how many packages were lost (i.e. how many times recv timed out) and then shut down the socket within the SocketStep.recv method.

xchoo commented 6 years ago

As far as I understand the current implementation, this is only happening for the receiving socket (with the SocketCloseThread) but not the sending socket (no timeout, so no SocketCloseThread).

Both the recv and send sockets implement this in a different way. The SocketCloseThread is needed for the recv sockets because it is blocking. For the send sockets, it does so implicitly through the simulator closing or a manual close call.

Yes, so we would time out in each time step, wouldn't we? Isn't that what the code is doing right now anyways?

Yes. That's why the code is there.

I don't think we are really dealing with that, we just take the next package with a timestep equal or larger to the current local timestep. (Or I guess, it is a very simple way of dealing with it, and probably sufficient for now.)

Yes we do. That's what the receive buffer is for.

Nope, the socket gets reopened.

Yeah. That might be a bug in the implementation. I can't remember why I put in the code to attempt reopening the socket. I think it was one of the last things I put in.

xchoo commented 6 years ago

Yes we do. That's what the receive buffer is for.

@tbekolay I think this might have been taken out in the refactoring? It probably needs to be put back in. It shouldn't happen too often (out of order packets usually happen when there are multiple paths from the source to the destination), so it's not too urgent.

jgosmann commented 6 years ago

The SocketCloseThread is needed for the recv sockets because it is blocking.

It is only blocking for a limited time in each timestep due to the timeout. After the simulation end the recv method would not be called anymore and thus the recv socket could be closed in the same way as the send socket.

I guess the only problem is that the sim.run can take a long time if we were to wait for the timeout in every time step, but as said before this could probably be better solved by keeping track of consecutive timeouts and closing the socket once a limit is exceeded instead of using an extra thread for that.

That's what the receive buffer is for.

That got removed by @tbekolay in 179559e9f73506f29cb31c870f687e3a7c161279. I think he made a comment about the buffer only ever containing a single element? But I cannot find it right now and don't know what it is based on.

xchoo commented 6 years ago

I guess the only problem is that the sim.run can take a long time if we were to wait for the timeout in every time step, but as said before this could probably be better solved by keeping track of consecutive timeouts and closing the socket once a limit is exceeded instead of using an extra thread for that.

One of the previous iterations of the socket code did exactly this. But it was replaced by the current implementations for reason I can no longer remember. I'll have to sit down and reanalyze the code to figure it out. Like I said before, networking code is difficult, especially when you have to deal with real world scenarios.

jgosmann commented 6 years ago

Note: If we reintroduce the buffer, we should consider heapq instead of queue.PriorityQueue.

tbekolay commented 6 years ago

I think this might have been taken out in the refactoring? It probably needs to be put back in. It shouldn't happen too often (out of order packets usually happen when there are multiple paths from the source to the destination), so it's not too urgent.

If you look at the old receiving block (which is where I started) buffer.put can only be called once in the step function, as the found_item will be set to True once something is buffered, meaning that the step function is done at that point. On the next step function invocation, the only way to buffer a second item is if the previously buffered item is from the past. But the buffered item is only ever buffered if it was future data anyway (due to this elif), so there was no possible way for the buffer to contain multiple items.

In the current version, I keep calling recv if we receive past data by virtue of only breaking out of the while True if the received packet is from the present or future. That value is only actually used if it's in the present, not the future. I'm not quite sure why we would reintroduce the buffer because the current implementation:

will skip past past data if we're only getting past data (i.e., we're much faster than the remote)
will queue up the next timestep's data if we're getting future data (i.e., we're much slower than the remote)

We're not currently doing anything to flush the socket's buffer in the case that we're much slower than the remote, but we weren't before either. We can add something for this if we think that's a big concern (but it's going to be hard to have two models stay synced if they execute at wildly different speeds).

Getting things out of order is hard to explicitly factor in. I believe that what we're currently doing is sufficient because if we get a future packet before a present one, we will end up dropping the present packet, which is in the spirit of UDP's non-guaranteed delivery. If we get a past packet before a present one, we'll ignore it and move on, so I don't think that the buffer gives us anything that we're not presently doing.

tbekolay commented 6 years ago

To summarize the above discussion (as I see it), the only thing that we should definitely add to the present implementation is:

Set a maximum number of retries for the receiving socket. If it reaches that maximum, we'll continue on with the last received value.

There are some things we could do, but I'm not convinced that they're for sure better:

Add a shutdown thread for the sending socket so that we don't have to wait for the sim to be garbage collected.
Don't try to reopen the receive socket at all.
Simplify the timeout thread to not adapt the timeout value based on how reliable the socket is.
Flush and queue the receiving socket's buffer if we're much slower than the remote.

I can do the "definitely add" part this afternoon. Or, if you want to start working on this @jgosmann let me know so we don't duplicate work.

FYI if anyone wants to get more context for this implementation, the history of it is still available and might help to see when things were added and why (@xchoo put notes in the commit messages).

xchoo commented 6 years ago

If you look at the old receiving block (which is where I started) buffer.put can only be called once in the step function, as the found_item will be set to True once something is buffered, meaning that the step function is done at that point. On the next step function invocation, the only way to buffer a second item is if the previously buffered item is from the past. But the buffered item is only ever buffered if it was future data anyway (due to this elif), so there was no possible way for the buffer to contain multiple items.

# Time stamp of first item in buffer is > t+dt (i.e. all
# items in the buffer are future packets). Assume packet
# for current time step has been lost, don't read the info,
# and wait for local sim time to catch up.

Ohhhhhhhhh. I changed the logic in the committed iteration of the code. Hah! Well, at some point in the development of this code, the receive logic was:

Get a packet, check timestamp
If timestamp is in the future, put in buffer
If timestamp is in the past, ignore it
If buffer has things in it, check buffer for packets with usable timestamp

I probably removed it because it was slowing down the packet processing time, and I thought it best to just jump ahead to the future timestamp. In either case, if it works, 👍

jgosmann commented 6 years ago

FYI if anyone wants to get more context for this implementation, the history of it is still available and might help to see when things were added and why (@xchoo put notes in the commit messages).

Right! I looked through it and it was at least a little bit helpful. From that history (nengo/nengo@96a580040758317c808ff22f1b2f932a43e1b9c3) it seems to me that the SocketCloseThread was introduced to close the sockets after the end of the simulation (when they are not in use anymore). In the referenced commit, this happened for both sending and receiving sockets. As long as no one can come up with a valid reason for why this should be done only for the receiving socket or is needed for other reasons than shutting down the sockets, I think we should do one of the following:

Remove the SocketCloseThread and related code. The sockets get shut down when the simulator object gets deleted. (That means the ports are blocked for other applications until that happens, but that doesn't seem like a major issue. If a second simulation is started in the process, the sockets can be reopened. That logic is currently implemented anyways. This is the solution I'm leaning towards.)
Use a SocketCloseThread for all sockets (sending and receiving). Handle a premature socket close also in the send function. (This definitely makes the code a lot more complex and the benefit seems questionable to me.)

Set a maximum number of retries for the receiving socket. If it reaches that maximum, we'll continue on with the last received value.

I agree on this one. Should this be a fixed number of retries or should this be adaptive? (The SocketCloseThread has some adaptive logic, but it isn't clear at the moment whether that had any effect or was broken anyways.)

Seems like everyone is fine with not dealing with out of order packets except dropping packets with a timestamp from the past and only keeping one future packet?

Or, if you want to start working on this @jgosmann let me know so we don't duplicate work.

I'll have to see, depends on the simulations I'm running. Let me know if you start on it first, I'll let you if I manage to start on it first.

tbekolay commented 6 years ago

I'd favor 2 because you never know when garbage collection is going to happen... I'm thinking about using the UDPSocket in Nengo GUI, you would likely have to change the port every time you restart the simulation.

1 might work if you do a combination of trying to reopen the socket at the start, and using the socket.SO_REUSEADDR flag on certain socket errors. I think this will end up being complicated, but perhaps not as complicated as 2.

Should this be a fixed number of retries or should this be adaptive?

I'd say fixed and specifiable by the user.

Let me know if you start on it first, I'll let you if I manage to start on it first.

Yep, I'll let you know if I start on it.

jgosmann commented 6 years ago

I'd favor 2 because you never know when garbage collection is going to happen...

Python does reference counting. So you actually when the simulator gets deleted as long as there are no cyclic references.

I'm thinking about using the UDPSocket in Nengo GUI, you would likely have to change the port every time you restart the simulation.

socket.SO_REUSEADDR might not be necessary for UDP sockets

tbekolay commented 6 years ago

Alright, well I don't think I'll get to this today so try it out.

jgosmann commented 6 years ago

I did a few tests in Nengo GUI and it apparently it keeps references to defined objects around even when it processes an updated source (maybe it should not, but that's a separate issue). That gives us two options:

Keep the SocketCloseThread for now (I still think that a solution to nengo/nengo#1118 would allow us to make the implementation much simpler and less confusing and hopefully we will be able to implement it that way someday.)
Use socket.SO_REUSEADDR. This seems to work from my initial tests and would simplify the code a lot by removing the necessity of the SocketCloseThread. However, I'm not sure about all the implications. I suppose, if another program is already using the same port to listen for UDP packets we would get some interference without any error?

I might lean towards option 2 as neither solution is perfect (and I don't think it has to be a perfect solution for now as nengo_extras is somewhat experimental), but leaves us with simpler, more understandable code that makes it easier for us or other people to improve on it in the future.

As a side note: I finally understood why the SocketCloseThread is not required for the sending socket. When a socket is not bound to and address to listen on, it doesn't block any other sockets (UDP sockets can send without an established connection). However, the current implementation binds all sockets to an address (even the sending socket), which means we either should not do that or need to employ one of the above to option also for the sending socket.

jgosmann commented 6 years ago

Starting on this now, going with the socket.SO_REUSEADDR route.

jgosmann commented 6 years ago

Just fyi: The tests are not failing for exceptions occuring in the simulator threads (these exceptions get hidden) and it seems that this makes the tests pass even if they should fail because neither simulator runs at all and we're comparing the zero-length vectors of data.

jgosmann commented 6 years ago

And this is one of the situations where I wish we weren't limited to Python 2.7. These sorts of tests are probably much easier to implement with futures.

jgosmann commented 6 years ago

Is there a reason that the UDPSendReceiveSocket is required? Could one have a separate send and receive socket?

xchoo commented 6 years ago

Yes. Because the simulation can deadlock if you don't have a socket that does both receive and send.

jgosmann commented 6 years ago

What exactly leads to the deadlock? If both ends wait for data to receive before they send their data?

xchoo commented 6 years ago

Yes. There is no guarantee in the simulator as to which nodes run first. In the socket code, it's important to send first, then block on receives. If it blocks on receives then attempts to send, both sides can block indefinitely.

jgosmann commented 6 years ago

But with UDP there is no guarantee that the sent packets actually arrive. So it could still deadlock if packets get dropped somewhere on the way, couldn't it? (Assuming recv has no timeout. But with a timeout a recv before a send would also not deadlock except for an initial wait on the first timeout.)

xchoo commented 6 years ago

But with UDP there is no guarantee that the sent packets actually arrive.

Yup. There is no guarantee. But if a packet is dropped, it won't lead to a deadlock scenario. The recv socket will just timeout and continue on it's merry way.

But with a timeout a recv before a send would also not deadlock except for an initial wait on the first timeout.

No. It will "deadlock" in the sense that because every packet it receives is timestamped in the past, every packet will be ignored, and it will time out for every time step of the simulation.

jgosmann commented 6 years ago

The recv socket will just timeout and continue on it's merry way.

That's why I said assuming no timeout.

Anyways, I came to the conclusion that the second deadlock case is might be what should happen if one were to create a both read sockets without a timeout (probably not a good idea?).

With timeout it is fine and will count timeouts as lost packages.

No. It will "deadlock" in the sense that because every packet it receives is timestamped in the past, every packet will be ignored, and it will time out for every time step of the simulation.

You're right. 👍

jgosmann commented 6 years ago

[ ] Note, that this code should still be tested on Linux (will do so next week)
[ ] and Windows.
[ ] Also, the interplay with hardware (Arduino) should be tested. Their timestamp checking might be adjusted as no half-timestep is added to transmitted values anymore (and I believe that was already removed before I touched the code).

xchoo commented 6 years ago

I haven't looked at the code in depth, but I have noticed that most of the timeout code has been removed. I can tell you right now that it will not work when you try to get it sync'ed up with hardware.

Consider this scenario:

Bidirectional communication between a PC and arduino
Simultaneous starting of simulations on both platforms is not guaranteed. (because they are located physically apart, or different processing times, etc)

In the old timeout code, you can set 2 timeout numbers: one that can be used to determine how much time the code needs to wait to initialize the connection, and one that is used to timeout per packet. In the updated code, there is only one timeout number, the timeout per packet. The two timeouts should not be the same, because:

You don't want to have only seconds (or milliseconds) to start both simulations
Or you don't want the sockets to have to wait seconds per packet if they are timing out, because it slows down the simulation a lot.

I hate to say this, but with the changes I see in the code, it looks like the initial (base) version of the socket code (but updated to use processes) I wrote before all the additions to the code were added to it work with differing hardware and socket scenarios (essentially, months of debugging and testing have been reverted). In it's current state, I'd say that a lot of testing with a lot of different setups are needed before I would even consider approving the PR.

jgosmann commented 6 years ago

Well, I can add in a separate (longer) timeout for the first packet to cover that case.

nengo / nengo-extras

Implemented UDP socket utility function for Nodes #53