DistributedCount.lf fails nondeterministically

MattEWeber commented 4 years ago

I just witnessed a nondeterministic result for the C test DistributedCount.lf. It failed in a Travis build for the pull request for my ProtoBuffTS branch, but then I restarted the build without changing anything and it passed.

Soroosh129 commented 4 years ago

This has happened to me as well on a few occasions.

lhstrh commented 4 years ago

Volunteers to fix it?

lsk567 commented 4 years ago

Happy to take a stab at it.

edwardalee commented 4 years ago

I've seen this too. I vaguely recall it may have something to do with how a federation is shut down...

lsk567 commented 4 years ago

Since repeatedly triggering the entire test suite is too expensive, I have isolated out DistributedCount.lf along with a few cases before (Determinism.lf, DeterminismThreaded.lf) and after it (DistributedCountPhysical.lf).

Here is the CI build: https://travis-ci.com/github/icyphy/lingua-franca/builds/177470082

I have triggered the build dozens of times and have not been able to reproduce the error yet, but I will keep testing to see if we can catch this bug.

If anyone encounters this issue again, kindly save the output from the CI server. Thanks!

lhstrh commented 4 years ago

Isn’t the output of all runs saved automatically?

On Tue, Jul 28, 2020 at 9:18 AM Shaokai Lin notifications@github.com wrote:

Since repeatedly triggering the entire test suite is too expensive, I have isolated out DistributedCount.lf along with a few cases before ( Determinism.lf, DeterminismThreaded.lf) and after it ( DistributedCountPhysical.lf).

Here is the CI build: https://travis-ci.com/github/icyphy/lingua-franca/builds/177470082

I have triggered the build dozens of times and have not been able to reproduce the error yet, but I will keep testing to see if we can catch this bug.

If anyone encounters this issue again, kindly save the output from the CI server. Thanks!

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/icyphy/lingua-franca/issues/198#issuecomment-665135879, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYD47DY7GSTPAR27JH6KQLR5324LANCNFSM4PGB4UZA .

--

-- Marten Lohstroh, MSc. | Ph.D. Candidate University of California | 545Q Cory Hall Berkeley, CA 94720 | +1 510 282 9135

lsk567 commented 4 years ago

@lhstrh If the build is restarted, the output of the previous run is overwritten. We currently don't have the output of the failed run so I hope to reproduce it.

lhstrh commented 4 years ago

Oh, I see. Another way of doing this is to write a simple script that runs this test a large number of times and pipes the output into a file. If there are any failures, you'll see them in the file.

lsk567 commented 4 years ago

Oh, I see. Another way of doing this is to write a simple script that runs this test a large number of times and pipes the output into a file. If there are any failures, you'll see them in the file.

This seems to be a better way to go about it. I will adopt this strategy on the CI server.

lsk567 commented 4 years ago

Tested locally on macOS Catalina 10.15.4 for 1000 runs without error. This bug might be platform-dependent.

lhstrh commented 4 years ago

This appears to be a platform-dependent issue tied to the way we deal with sockets. Occasionally, an EOF is received from a previous run, or the connection is ended by one of the peers. Adding a sleep(1) at the end of wait_for_federates() has shown to prevents the latter in over 200 runs, and I expect adding a sleep in some other location is likely to fix the former. I'm not suggesting this as a solution, but as a confirmation of the diagnosis. While it's clear that is an issue to begin with because we reuse the same socket, I think the problem might be that we're simply not waiting for socket closures to conclude properly. Here's a discussion of the state machine that underpins the protocol.

More generally, it occurs to me that the current implementation won't work on any platform if we want to allow multiple federations to execute simultaneously if their RTI is hosted on the same machine (which seems like it would be a common use case on an edge device or server). Maybe we should first evaluate the design before attempting to fix this bug...

edwardalee commented 4 years ago

Wouldn’t multiple federations on the same machine have to different ports?

Edward

Edward A. Lee EECS, UC Berkeley eal@eecs.berkeley.edu http://eecs.berkeley.edu/~eal

On Aug 20, 2020, at 10:38 PM, Marten Lohstroh notifications@github.com wrote:

This appears to be a platform-dependent issue tied to the way we deal with sockets. Occasionally, and EOF is received from a previous run, or the connection is ended by one of the peers. Adding a sleep(1) at the end of wait_for_federates() has shown to prevents the latter in over 200 runs, and I expect adding a sleep in some other location is likely to fix the former. I'm not suggesting this as a solution, but as a confirmation of the diagnosis. While it's clear that is an issue to begin with because we reuse the same socket, I think the problem might be that we're simply not waiting for socket closures to conclude properly. Here's a discussion of the state machine that underpins the protocol.

More generally, it occurs to me that the current implementation won't work on any platform if we want to allow multiple federations to execute simultaneously if their RTI is hosted on the same machine (which seems like it would be a common use case on an edge device or server). Maybe we should first evaluate the design before attempting to fix this bug...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

lhstrh commented 4 years ago

That depends. In a realistic setting (which may require opening port access in firewalls, etc.) you don't want to frivolously allocate ports. Besides, even if take that approach, how do you determine which port to use? And how do you prevent two federations from attempting to use the same port? It seems to me that these resource allocation issues need to be solved at runtime. One option would be to have a daemon that listens on a predetermined port and forwards packets to the appropriate end points. If I want to run two ssh sessions with the same host I also connect to port 22 for both sessions...

edwardalee commented 4 years ago

This could be solved by having a smarter, more universal RTI that could support multiple federations at once, perhaps.

lhstrh commented 4 years ago

Yes. And there are several existing solutions that could perform this role, pretty much out-of-the-box. Using a ROS Master and using its communication fabric would be one of them.

edwardalee commented 4 years ago

The ROS communication favors is pub/sub. We are on record claiming this is no good. You could overlay on it something that realizes reactor semantics, but I suspect the cost would be substantial.

lhstrh commented 4 years ago

We're also on record claiming that TCP/IP is no good; we need a coordination layer on top regardless.

I've asked @Soroosh129 about the overhead of ROS (compared to plain TCP/IP) and based on what he had seen he said the overhead was in the order of 10%. This sounds promising, but we should run some benchmarks. Our own implementation is brittle and doesn't scale due to the fiddling with ports (see discussion above), and any future proof solution I can think of will take significant time to develop and already exists in a numerous variants in other frameworks. We should really think twice about sinking a lot of time into a writing a middle ware ourselves...

Note that my suggestion would not be to bet on one horse, but to offer support for different messaging agents (see #214). This would give us an entry point into legacy systems. Insisting on our own implementation could isolate us. I guess the question is whether we want to aim for adoption and community engagement or merely build a proof-of-concept.

edwardalee commented 4 years ago

ROS is not portable. It runs on one version of Linux. Only. Also, realizing our semantics on top of a pub/sub fabric will likely be quite difficult. At least with a TCP/IP socket, you can assume reliable, in order delivery on each socket. The RTI currently depends on this. Pub/sub offers no such assurance.

lhstrh commented 4 years ago

I'm aware of the shortcomings of ROS. The question is whether we want to fix the problems related to ports in our own RTI implementation, or use a messaging fabric that has already solved these problems for us.

The RTI currently only supports centralized coordination; I'm more interested in distributed coordination, which will not rely on any message ordering guarantees at all. Hence, I think it would be a reasonable approach to try and implement distributed execution using something like ROS (there are plenty alternatives if we decide that we don't like ROS), and make the architecture sufficiently flexible so that we can plug in other middle wares as well, possibly including our own.

lhstrh commented 4 years ago

While it may be fruitful to explore solutions that use other communication fabrics, here's a possible solution for patching the existing RTI implementation that @edwardalee I just came up with.

A daemon -- let's call it lfd for the time being -- is assumed to run on each host that we wish to map federates onto. It runs at a globally agreed upon port. lfd maintains a pool of reserved ports, and a local RTI or federate can request a port from that pool by registering. After an RTI is registered and has started listening at the port that lfd told it to use, lfd lets (remote) federates discover that port. To ensure that federates discover the port handed out to the particular RTI belonging to the same federation instance, the script that initiates the execution has to create a unique key and pass it to each participant in the federation to be used during the registration and query process.

Example Execution Run

Generate unique key K.
Ssh into the host H that will run the RTI. Start the RTI (and start lfd if it isn't running already), which:
- Registers using K (passed as command line argument) with lfd running on H
- Starts listening on obtained port P
Centralized coordination:
- For each federate x, ssh into host Fx. Start the federate, which:
- Contacts lfd running on H; passes it K (passed as command line argument) and retrieves P
- Starts interacting with RTI as it normally would
Distributed coordination:
- For each federate x, ssh into host Fx. Start the federate (and start lfd if it isn't running already), which:
- Contacts lfd running on H; passes it K (passed as command line argument) and retrieves P
- Registers the federate with lfd running on Fx and starts listening on obtained port Q
- Discovers ports of other federates by querying lfd on their respective hosts and sets up direct communication channels
- Coordinates start of execution with RTI and continues by exchanging messages directly with other federates

lhstrh commented 4 years ago

Note that the mentioned key is only meant to identify federation instances (i.e., it is a session key), and is not meant for authentication or encryption. If we wish to authenticate end points (to prevent spoofing) and encrypt communication, we could generate a separate symmetric encryption key for that.

Soroosh129 commented 4 years ago

You wouldn’t need a daemon for this. You could have a .lfhosts file in each host that contains a <key, port> pair. Everybody can read that file anywhere. The list of ports can also be in that file with a <none, port> listing.

lhstrh commented 4 years ago

This wouldn't solve the problem. The issue is that it is not statically known which instances are running and which ports are available. We need a process that can keep track of active federations, used ports, and unused ports. A file is not sufficient to do this.

Soroosh129 commented 4 years ago

If the key is uniquely generated at execution time, each active federation will have a unique key. Each federation could remove its entry from the file once it exits (it would need to inform lfd anyway).

Soroosh129 commented 4 years ago

The only downside to having a file instead of a daemon that I could see is that there could be a race condition on the file and two unique keys might end up with the same port. A daemon can potentially serialize requests if it is carefully implemented.

lhstrh commented 4 years ago

It's not just the key that needs to be generated. The key merely ties a port to session. It's also the port that needs to be allocated. How is it picked? And how are federates made aware of the port at which to reach the RTI (or each other?)

Soroosh129 commented 4 years ago

Imagine a file with the following abstract structure:

federate_port_range = 5000 - 5100
key1 : 5000
key2 : 5001

whenever a new federation is launched, it would need to scan the file and add an entry to that file for each host. Remote federates would then only need to read but not write to the .lfhosts file of others. Once the federation is done, it would remove that entry.

The benefit of this method is that port 22 is always available whereas a universally agreed upon port for lfd could still be occupied. Another benefit of this approach is that an extra daemon process does not have to be active.

The downside is the aforementioned potential race condition.

lhstrh commented 4 years ago

I agree that using SSH has an advantage in the sense that we can use a universally available port. If we can find an elegant way to avoid the race condition this approach might be preferable. Perhaps a file-based database (SQLite) would do.

cmnrd commented 4 years ago

This discussion sounds a lot like reinventing the wheel. Maybe ROS is not a good choice, but there are many more middlewares out there. There are two problems that need to be solved:

Provide a mechanism for communicating between federates and/or RTI
Coordinate the execution across federates

I believe these are orthogonal problems. We can build a coordination protocol on top of an existing middleware. For instance SOME/IP could fit our requirements (This is the middleware that was used in my AUTOSAR work). We could simply generate a SOME/IP configuration from the LF program and configure the SOME/IP daemons on each host accordingly. I don't see value in creating a daemon of our own, unless that daemon solves both problems at once and performs better than a coordination layer on top of the middleware.

edwardalee commented 4 years ago

These two are not actually orthogonal. At least in the centralized coordination, the ordering of timestamped messages and their passage through the RTI is tied directly into the advancement of time for each federate. The code depends heavily on the assurance that a TCP/IP socket preserves ordering and ensures reliable delivery.

I'm certainly open to using some other middleware, as long as the following conditions are satisfied:

It is portable. Runs on any POSIX-compliant system, or, even better, on any system with a TCP/IP stack.
It enables creation of channels that look like TCP/IP sockets (reliable, in-order delivery).
It supports encrypted communication (I think symmetric keys will be sufficient until we allow federates to join a federation at run time).
It is reasonably lightweight.

lhstrh commented 4 years ago

The new port allocation scheme addresses this issue. I have not seen failures related to this problem lately.

lf-lang / lingua-franca

DistributedCount.lf fails nondeterministically #198

Example Execution Run