Closed MattEWeber closed 4 years ago
This has happened to me as well on a few occasions.
Volunteers to fix it?
Happy to take a stab at it.
I've seen this too. I vaguely recall it may have something to do with how a federation is shut down...
Since repeatedly triggering the entire test suite is too expensive, I have isolated out DistributedCount.lf
along with a few cases before (Determinism.lf
, DeterminismThreaded.lf
) and after it (DistributedCountPhysical.lf
).
Here is the CI build: https://travis-ci.com/github/icyphy/lingua-franca/builds/177470082
I have triggered the build dozens of times and have not been able to reproduce the error yet, but I will keep testing to see if we can catch this bug.
If anyone encounters this issue again, kindly save the output from the CI server. Thanks!
Isn’t the output of all runs saved automatically?
On Tue, Jul 28, 2020 at 9:18 AM Shaokai Lin notifications@github.com wrote:
Since repeatedly triggering the entire test suite is too expensive, I have isolated out DistributedCount.lf along with a few cases before ( Determinism.lf, DeterminismThreaded.lf) and after it ( DistributedCountPhysical.lf).
Here is the CI build: https://travis-ci.com/github/icyphy/lingua-franca/builds/177470082
I have triggered the build dozens of times and have not been able to reproduce the error yet, but I will keep testing to see if we can catch this bug.
If anyone encounters this issue again, kindly save the output from the CI server. Thanks!
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/icyphy/lingua-franca/issues/198#issuecomment-665135879, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYD47DY7GSTPAR27JH6KQLR5324LANCNFSM4PGB4UZA .
--
-- Marten Lohstroh, MSc. | Ph.D. Candidate University of California | 545Q Cory Hall Berkeley, CA 94720 | +1 510 282 9135
@lhstrh If the build is restarted, the output of the previous run is overwritten. We currently don't have the output of the failed run so I hope to reproduce it.
Oh, I see. Another way of doing this is to write a simple script that runs this test a large number of times and pipes the output into a file. If there are any failures, you'll see them in the file.
Oh, I see. Another way of doing this is to write a simple script that runs this test a large number of times and pipes the output into a file. If there are any failures, you'll see them in the file.
This seems to be a better way to go about it. I will adopt this strategy on the CI server.
Tested locally on macOS Catalina 10.15.4
for 1000 runs without error. This bug might be platform-dependent.
This appears to be a platform-dependent issue tied to the way we deal with sockets. Occasionally, an EOF
is received from a previous run, or the connection is ended by one of the peers. Adding a sleep(1)
at the end of wait_for_federates()
has shown to prevents the latter in over 200 runs, and I expect adding a sleep
in some other location is likely to fix the former. I'm not suggesting this as a solution, but as a confirmation of the diagnosis. While it's clear that is an issue to begin with because we reuse the same socket, I think the problem might be that we're simply not waiting for socket closures to conclude properly. Here's a discussion of the state machine that underpins the protocol.
More generally, it occurs to me that the current implementation won't work on any platform if we want to allow multiple federations to execute simultaneously if their RTI is hosted on the same machine (which seems like it would be a common use case on an edge device or server). Maybe we should first evaluate the design before attempting to fix this bug...
Wouldn’t multiple federations on the same machine have to different ports?
Edward
Edward A. Lee EECS, UC Berkeley eal@eecs.berkeley.edu http://eecs.berkeley.edu/~eal
On Aug 20, 2020, at 10:38 PM, Marten Lohstroh notifications@github.com wrote:
This appears to be a platform-dependent issue tied to the way we deal with sockets. Occasionally, and EOF is received from a previous run, or the connection is ended by one of the peers. Adding a sleep(1) at the end of wait_for_federates() has shown to prevents the latter in over 200 runs, and I expect adding a sleep in some other location is likely to fix the former. I'm not suggesting this as a solution, but as a confirmation of the diagnosis. While it's clear that is an issue to begin with because we reuse the same socket, I think the problem might be that we're simply not waiting for socket closures to conclude properly. Here's a discussion of the state machine that underpins the protocol.
More generally, it occurs to me that the current implementation won't work on any platform if we want to allow multiple federations to execute simultaneously if their RTI is hosted on the same machine (which seems like it would be a common use case on an edge device or server). Maybe we should first evaluate the design before attempting to fix this bug...
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
That depends. In a realistic setting (which may require opening port access in firewalls, etc.) you don't want to frivolously allocate ports. Besides, even if take that approach, how do you determine which port to use? And how do you prevent two federations from attempting to use the same port? It seems to me that these resource allocation issues need to be solved at runtime. One option would be to have a daemon that listens on a predetermined port and forwards packets to the appropriate end points. If I want to run two ssh
sessions with the same host I also connect to port 22
for both sessions...
This could be solved by having a smarter, more universal RTI that could support multiple federations at once, perhaps.
Yes. And there are several existing solutions that could perform this role, pretty much out-of-the-box. Using a ROS Master and using its communication fabric would be one of them.
The ROS communication favors is pub/sub. We are on record claiming this is no good. You could overlay on it something that realizes reactor semantics, but I suspect the cost would be substantial.
We're also on record claiming that TCP/IP is no good; we need a coordination layer on top regardless.
I've asked @Soroosh129 about the overhead of ROS (compared to plain TCP/IP) and based on what he had seen he said the overhead was in the order of 10%. This sounds promising, but we should run some benchmarks. Our own implementation is brittle and doesn't scale due to the fiddling with ports (see discussion above), and any future proof solution I can think of will take significant time to develop and already exists in a numerous variants in other frameworks. We should really think twice about sinking a lot of time into a writing a middle ware ourselves...
Note that my suggestion would not be to bet on one horse, but to offer support for different messaging agents (see #214). This would give us an entry point into legacy systems. Insisting on our own implementation could isolate us. I guess the question is whether we want to aim for adoption and community engagement or merely build a proof-of-concept.
ROS is not portable. It runs on one version of Linux. Only. Also, realizing our semantics on top of a pub/sub fabric will likely be quite difficult. At least with a TCP/IP socket, you can assume reliable, in order delivery on each socket. The RTI currently depends on this. Pub/sub offers no such assurance.
I'm aware of the shortcomings of ROS. The question is whether we want to fix the problems related to ports in our own RTI implementation, or use a messaging fabric that has already solved these problems for us.
The RTI currently only supports centralized coordination; I'm more interested in distributed coordination, which will not rely on any message ordering guarantees at all. Hence, I think it would be a reasonable approach to try and implement distributed execution using something like ROS (there are plenty alternatives if we decide that we don't like ROS), and make the architecture sufficiently flexible so that we can plug in other middle wares as well, possibly including our own.
While it may be fruitful to explore solutions that use other communication fabrics, here's a possible solution for patching the existing RTI implementation that @edwardalee I just came up with.
A daemon -- let's call it lfd
for the time being -- is assumed to run on each host that we wish to map federates onto. It runs at a globally agreed upon port. lfd
maintains a pool of reserved ports, and a local RTI or federate can request a port from that pool by registering. After an RTI is registered and has started listening at the port that lfd
told it to use, lfd
lets (remote) federates discover that port. To ensure that federates discover the port handed out to the particular RTI belonging to the same federation instance, the script that initiates the execution has to create a unique key and pass it to each participant in the federation to be used during the registration and query process.
lfd
if it isn't running already), which:
lfd
running on Hlfd
running on H; passes it K (passed as command line argument) and retrieves Plfd
if it isn't running already), which:lfd
running on H; passes it K (passed as command line argument) and retrieves Plfd
running on Fx and starts listening on obtained port Qlfd
on their respective hosts and sets up direct communication channelsNote that the mentioned key is only meant to identify federation instances (i.e., it is a session key), and is not meant for authentication or encryption. If we wish to authenticate end points (to prevent spoofing) and encrypt communication, we could generate a separate symmetric encryption key for that.
You wouldn’t need a daemon for this. You could have a .lfhosts
file in each host that contains a <key, port> pair. Everybody can read that file anywhere.
The list of ports can also be in that file with a <none, port> listing.
This wouldn't solve the problem. The issue is that it is not statically known which instances are running and which ports are available. We need a process that can keep track of active federations, used ports, and unused ports. A file is not sufficient to do this.
If the key is uniquely generated at execution time, each active federation will have a unique key. Each federation could remove its entry from the file once it exits (it would need to inform lfd
anyway).
The only downside to having a file instead of a daemon that I could see is that there could be a race condition on the file and two unique keys might end up with the same port. A daemon can potentially serialize requests if it is carefully implemented.
It's not just the key that needs to be generated. The key merely ties a port to session. It's also the port that needs to be allocated. How is it picked? And how are federates made aware of the port at which to reach the RTI (or each other?)
Imagine a file with the following abstract structure:
federate_port_range = 5000 - 5100
key1 : 5000
key2 : 5001
whenever a new federation is launched, it would need to scan the file and add an entry to that file for each host. Remote federates would then only need to read but not write to the .lfhosts
file of others. Once the federation is done, it would remove that entry.
The benefit of this method is that port 22 is always available whereas a universally agreed upon port for lfd
could still be occupied.
Another benefit of this approach is that an extra daemon process does not have to be active.
The downside is the aforementioned potential race condition.
I agree that using SSH has an advantage in the sense that we can use a universally available port. If we can find an elegant way to avoid the race condition this approach might be preferable. Perhaps a file-based database (SQLite) would do.
This discussion sounds a lot like reinventing the wheel. Maybe ROS is not a good choice, but there are many more middlewares out there. There are two problems that need to be solved:
I believe these are orthogonal problems. We can build a coordination protocol on top of an existing middleware. For instance SOME/IP could fit our requirements (This is the middleware that was used in my AUTOSAR work). We could simply generate a SOME/IP configuration from the LF program and configure the SOME/IP daemons on each host accordingly. I don't see value in creating a daemon of our own, unless that daemon solves both problems at once and performs better than a coordination layer on top of the middleware.
These two are not actually orthogonal. At least in the centralized coordination, the ordering of timestamped messages and their passage through the RTI is tied directly into the advancement of time for each federate. The code depends heavily on the assurance that a TCP/IP socket preserves ordering and ensures reliable delivery.
I'm certainly open to using some other middleware, as long as the following conditions are satisfied:
The new port allocation scheme addresses this issue. I have not seen failures related to this problem lately.
I just witnessed a nondeterministic result for the C test DistributedCount.lf. It failed in a Travis build for the pull request for my ProtoBuffTS branch, but then I restarted the build without changing anything and it passed.