nanomsg / nng

nanomsg-next-generation -- light-weight brokerless messaging
https://nng.nanomsg.org
MIT License
3.84k stars 491 forks source link

Reproducing IPC transport unit tests causing INFINITE lock wait #117

Closed mwpowellhtx closed 7 years ago

mwpowellhtx commented 7 years ago

Couched in my understanding that the API isn't quite ready yet. This may be one of those areas, I don't know.

Whether following the C-style calls or the C++ wrapper, it does not seem to matter. Something in the way the testing is initialized "under the hood" is causing the tests to hang INFINITE-ly.

When there is a valid listen/dial pair in effect, the dial operation hangs INFINITE-ly. Debugging gets as far as nni_ep_dial(...) and the subsequent nni_aoi_wait(...). Which then waits with INFINITE options on the call to SleepConditionVariableSRW.

// Gets as far as here...
int nni_ep_dial(nni_ep *ep, int flags)
// Then blocks indefinitely on the (synchronous?) AOI to return.
nni_aio_wait(aio);
// Ultimately waiting on a condition variable that is never cleared, apparently.
// And with "INFINITE" timeout? not otherwise queued on the socket options? i.e. timeouts?
void
nni_plat_cv_wait(nni_plat_cv *cv)
{
  (void) SleepConditionVariableSRW(&cv->cv, cv->srl, INFINITE, 0);
}

TCP works great. So does INPROC. IPC, on the other hand, is problematic under Windows (i.e. Windows 7 x64).

The only difference I can determine at this point is in how the NNG IPC transport tests are initialized.

void
trantest_init(trantest *tt, const char *addr)
{
  trantest_next_address(tt->addr, addr);
  So(nng_req_open(&tt->reqsock) == 0); // These look innocent enough.
  So(nng_rep_open(&tt->repsock) == 0);

  tt->tran = nni_tran_find(addr); // This however I find suspicious.
  So(tt->tran != NULL); // And/or perhaps something in the way that the project is build, flags, etc.
}
mwpowellhtx commented 7 years ago

Could be completely unrelated. I was also noticing that not all the transport handlers were populated a couple of weeks ago when I considered whether to build NNG under "C++". Which revealed a couple of missing function pointers/handlers. Perhaps that's by design.

gdamore commented 7 years ago

The difference in initialization entry points is due to some of the entry points being optional.

I've used IPC on Windows, and we have tests that run against it at AppVeyor -- so I'm pretty sure that what is happening is a misunderstanding.

If you pass no flags to nni_ep_dial(), then it blocks forever until the connection is established. If no server is available or responding, then that means it will block forever. (Note that if a server is not present at all, it should get NNG_ECONNREFUSED. If the server is not accepting connections, because say it is stopped in a debugger, then the client will hang here.)

When I finish the aio based APIs, there will be a callback based asynchronous version of this available. In the mean time if you don't want to block forever pass NNG_FLAG_NONBLOCK -- in that case we start the operation and complete it later, asynchronously. Unfortunately there is no callback to the user that this has occurred.

mwpowellhtx commented 7 years ago

Of course, couched in, "work in progress." Of course, they "magically" work from the C code, which leads me to believe a backdoor approach was taken. Because modeling after the unit tests themselves, it blocks.

void
trantest_listen_accept(trantest *tt)
{
    Convey("Listen and accept", {
        nng_listener l;
        nng_dialer   d;
        So(nng_listen(tt->repsock, tt->addr, &l, 0) == 0);
        So(l != 0);

        d = 0;
        So(nng_dial(tt->reqsock, tt->addr, &d, 0) == 0);
        //                                    ^^^
        So(d != 0);
    })
}

Which , as I said, the only different that I could determine was in the initialization.

    tt->tran = nni_tran_find(addr);
    So(tt->tran != NULL);

So, you're saying pass the flags in and that should "fix" it?

Or have I missed something?

gdamore commented 7 years ago

If you are doing exactly the same thing that the C code is doing, then you're almost certainly missing something somewhere else (environmental). Perhaps you have another instance of the tests running somewhere, using the same URL?

If you listen, and that listen succeeds, then the dial should just work. The exception here being something weird around the details of the named pipes themselves, like an application stalling out or someone deleting and recreating the pipe or some such.

mwpowellhtx commented 7 years ago

@gdamore No sir. That is the only instance of it running. My Url's are incrementing with a similar strategy that you employed with the transport test features, getting the next or previous port, for instance.

I haven't checked, but aren't pipes a Windows "feature" that needs to be installed in control panels? If memory serves, which it may not. I haven't checked that, however.

With that said, I am running Windows 7 Pro x64.

gdamore commented 7 years ago

It is possible that I'm doing something that works on Windows 8.1 and 10, but does not work on Windows 7 or works differently. I don't think this is the case, but I could easily be mistaken.

What are you passing to the nng_listen and nng_dial calls? Have you tried putting break points in the debugger and seeing what happens?

gdamore commented 7 years ago

You can also definitely ditch the whole nni_tran_find() logic. That's used in the C code because I'm validating some things that are "under the hood".

mwpowellhtx commented 7 years ago

@gdamore I don't think we're communicating here, my friend.

The behavior blocks, even with the C code, which I stood up along side the C++ wrapper for comparison. That I can determine, short of compiler definitions, the ONLY difference I can fathom is the nni_* involvement.

So, presently the supposition is that perhaps dialing requires the non-blocking flags?

gdamore commented 7 years ago

Well let me look. It is possible that the initialization is somehow botched but I am skeptical. On Wed, Oct 18, 2017 at 2:33 PM Michael W Powell notifications@github.com wrote:

@gdamore https://github.com/gdamore I don't think we're communicating here, my friend.

The behavior blocks, even with the C code, which I stood up along side the C++ wrapper for comparison. That I can determine, short of compiler definitions, the ONLY difference I can fathom is the nni_* involvement.

So, presently the supposition is that perhaps dialing requires the non-blocking flags?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/nanomsg/nng/issues/117#issuecomment-337734528, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPDfWn9PQ6l3XL_SSmsFVHPj58XB72Fks5stm6fgaJpZM4P8sCZ .

gdamore commented 7 years ago

I don't have a Windows 7 test bed....

gdamore commented 7 years ago

That said, these tests are running on Windows Server 2012 without any issue whatsoever.

mwpowellhtx commented 7 years ago

The areas that are suspect to me are the call to ::nni_tran_find. I'm not calling that one like in the C test. That I know of, I don't have a reason to, unless it is causing something to pass otherwise.

There was also the supposition of the non-blocking flags? Since I am providing 0 (no flags) like with the C tests, then it is "blocking" infinitely by design?

I don't think this has anything to do with Windows Server 2012 versus Windows 7. Named pipes have been around since NT days. Plus I've written for overlapped pipes on this machine before, so support or not is not the issue.

The usage of the INFINITE keyword in the NNG code, however, is at least one possible concern.

gdamore commented 7 years ago

The blocking should be infinite, yes, as least until a server is ready on the other side. I'm pretty sure that my use of INFINITE is correct in that context.

As I said, my tests pass normally in CI, so I don't know why they aren't for you. I presume you're just running my test suite directly?

gdamore commented 7 years ago

If that's not the case, please paste me your code.

mwpowellhtx commented 7 years ago

I wonder if it is a parsing/naming convention issue.

Windows pipes are in the form \\.\pipe..., correct? Whereas I gather Linux pipes are literally a path to the file system object, i.e. /tmp/nng_ipctest.

Does this naming convention need to be followed for either platform?

Here's my test code as it stands. Much of the actual work is behind transport_fixture struct, and header.

So in other words, perhaps the wait is correct, that the pipe was never created properly in the first place.

gdamore commented 7 years ago

We prefix the value of the pipe address with \.\pipe\

You can see that here: https://github.com/nanomsg/nng/blob/master/src/platform/windows/win_ipc.c#L225

So you don't need to set up anything there. (Oh, btw, I don't use (C) in copyright notices -- it would be redundant at best (to the word "Copyright"), and in ASCII (where there is no © glyph) it has no legal value at all. For the same reason I don't use "All rights reserved." -- this is redundant to the law, and has no value in modern copyright law.)

I will review your test code.

gdamore commented 7 years ago

I'm wondering if you are incrementing the address properly -- I try to ensure that each "major" grouping gets a new address to avoid collisions with prior uses in the suite. One thing you could do is add a printf() (or cout<< debug) at the point you try dial or listen, to make sure that the addresses are what you think they are, and that they are different from any prior uses.

My gut instinct is that address selection is the heart of the problem here.

gdamore commented 7 years ago

Btw, the prefixes above aren't quite right due to github markdown escaping of backslash... But check the actual code.

mwpowellhtx commented 7 years ago

@gdamore The code runs fine for TCP, TCP6, for instance. Just not for the IPC, in either C or C++ cases. Which is a bit unusual. Usually the C++ wrapper will have an issue which I would expect and the C code runs okay.

The only differences that I can tell are already reported.

mwpowellhtx commented 7 years ago

@gdamore You need to add additional backslashes to the markdown for it to report correctly BTW.

mwpowellhtx commented 7 years ago

@gdamore I would buy the address/port proposition if the tests were not working at all for TCP or TCP6, but they are, so that cannot be it.

mwpowellhtx commented 7 years ago

Address is ipc://tmp\\nng_ipctest:\<port> in my working branch. Which does appear to be yielding the name \\\\.\\pipe\\tmp\\nng_ipctest:\<port>.

mwpowellhtx commented 7 years ago

Hrm, maybe it is seeing the :\<port> differently than intended. I'll try updating the addressing strategy.

mwpowellhtx commented 7 years ago

I took the delimiter out for IPC and it still blocks.

gdamore commented 7 years ago

Right... but what is the value of ? Is it changing? Do any of the other nng_dial() operations in the IPC test suite complete?

gdamore commented 7 years ago

You can, btw, set a break point in nni_ipc_ep_connect(). or nni_plat_ipc_ep_connect() to debug.

You can also look at tracing what is happening inside the win_ipc.c file -- and the nni_plat_ipc_ep_connect function in particular.

mwpowellhtx commented 7 years ago

Yes, I'm headed that direction next, possibly. ~I dial with non-block, and it still blocks.~ I take that back. Non blocking does not block. But I will debug into those areas.

mwpowellhtx commented 7 years ago

As previously reported, it connects successfully, but is blocking through calls to: nni_aio_wait(aio) during the nni_ep_dial.

mwpowellhtx commented 7 years ago

Are you sure you've created the pipe with the correct flags? There are literally ALL KINDS of flags, which help to regulate access, even down to things like permissions to read, write, connect, etc, if memory serves.

mwpowellhtx commented 7 years ago

Or perhaps instead of blocking with mutex, set a flag upon overlapped I/O using an atomic. I don't know, I'm just throwing ideas out there now.

mwpowellhtx commented 7 years ago

So, if the INFINITE sleep waiting for the condition variable is appropriate, then it ain't being awoken. That's the only other conclusion I can derive from that scenario at the moment.

void
nni_plat_cv_wait(nni_plat_cv *cv)
{
    (void) SleepConditionVariableSRW(&cv->cv, cv->srl, INFINITE, 0);
}
mwpowellhtx commented 7 years ago

What value? The address is now specified "without colon delimiter" so to say, but that did not change the block.

@gdamore So, the other tests are listening/dialing as they should up to that point.

Only at the listener/dialer step is it blocking on the dial.

REQUIRE_NOTHROW(_repp->listen(addr, &l));
REQUIRE(l.has_one() == true);

REQUIRE_NOTHROW(_reqp->dial(addr, &d)); // dialing with 0 for flags
REQUIRE(d.has_one() == true);
gdamore commented 7 years ago

The actual full string -- including the expanded digits. Each test iteration is supposed to get a unique URL. IF that isn't working properly, chaos may ensue.

I'm pretty sure I'm creating the pipe properly -- I'm mostly using defaults (for now) but please feel free to examine the source code.

One thing you could try, is to remove the initialization step from the trantest.h header, rebuild my test, and see if that changes anything. I contend it should not. (Although.... that might cause the address to be different because of one less test iteration...)

mwpowellhtx commented 7 years ago

It's not the address.

As far as "iterations" are concerned, there is only one iteration. At least modeling after the C style tests.

That said, I'll humor you, but this is not the issue:

void transport_fixture::run_all() {

    const auto addr = get_next_addr(_base_addr);

    //("Given transport for address " + _current_addr).c_str()
    SECTION("Given transport for address '" + addr + "'") {
        run_all(addr);
    }
}

And in the address calculator:

int address_calculator::get_port(int delta) {
    return _port += delta;
}

std::string address_calculator::get_addr(const std::string& base_addr, int delta) {
    std::ostringstream os;
    os << base_addr << _port_delim << get_port(delta);
    return os.str();
}

std::string address_calculator::get_next_addr(const std::string& base_addr, int delta) {
    return get_addr(base_addr, std::abs(delta));
}

And in the run all:

void transport_fixture::run_all(const std::string& addr) {

    run_connection_refused_works(addr);
    run_dup_listeners_rejected(addr);
    run_listener_and_dialer_accepted(addr);
    run_send_and_receive(addr);
    run_send_and_receive_large_data(addr);
}

Which is exactly modeled after the C code:

trantest_init(&tt, addr);

Reset({ trantest_fini(&tt); });

trantest_scheme(&tt);
trantest_conn_refused(&tt);
trantest_duplicate_listen(&tt);
trantest_listen_accept(&tt);
trantest_send_recv(&tt);
trantest_send_recv_large(&tt);

And calculation:

void
trantest_next_address(char *out, const char *template) {
    if (trantest_port == 0) {
        char *pstr;
        trantest_port = 5555;
        if (((pstr = ConveyGetEnv("TEST_PORT")) != NULL) &&
             (atoi(pstr) != 0)) {
            trantest_port = atoi(pstr);
        }
    }
    (void) snprintf(out, NNG_MAXADDRLEN, template, trantest_port);
    trantest_port++;
}

The only difference in the port source is that you draw it from the environment variable whereas I instructed CMake to tell the compiler about it as a definition. That's the ONLY difference there.

Otherwise, I do not see any repeated calls for the next address.

mwpowellhtx commented 7 years ago

@gdamore Heading into the blocking SleepConditionVariableSRW call there is indeed an SRWLock in play, whereas when it does not block up to that point, there does not appear to be any lock, at least gauging from my IntelliSense inspection. Maybe it is an exclusive lock when it should be CONDITION_VARIABLE_LOCKMODE_SHARED? So, some callback or something is potentially not clearing that lock.

gdamore commented 7 years ago

Again, I think we're running into the problem that you are assuming things that are incorrect about the test execution order in the C code.

Each iteration through calls trantest_init()... because each of those subtests is a separate convey block.

This causes each of the trantest_xxx functions to get a different address than the one before or after it. So there are like 6 different addresses used within the test.

I really really wish you had not tried to copy my C tests, because they were never intended to be ported to other languages, or used as models. They were intended for exactly one thing, which was to simplify writing my own tests for the core nng.

mwpowellhtx commented 7 years ago

Oh?

gdamore commented 7 years ago

This is the evil that is setjmp / longjmp in action. I modeled this after a Golang test suite, and for the context I'm using it in, it works great for me. But care must taken when trying to understand what exactly is happening.

gdamore commented 7 years ago

The condition variables and slim locks are right. The function is sleeping pending the notification, and it will sleep forever until that notification arrives. That code is working precisely as designed.

mwpowellhtx commented 7 years ago

I'll put prints in to check, but that's how Catch works as well. Each leaf SECTION is guaranteed an execution, starting from the TEST_MAIN on down. I will check.

gdamore commented 7 years ago

The problem is that your address doesn't change between SECTIONs because the increment is in an outer loop -- I think.

This is why I've been asking you to check for the address values precisely.

mwpowellhtx commented 7 years ago

Can you show me where yours is being incremented?

gdamore commented 7 years ago

It is in trantest_init(). You will see it calls trantest_next_address().

gdamore commented 7 years ago

The design here is that each major test suite gets a new number (from the environment) to start at -- I separate the suites by 10 - and then each subtest can allocate numbers as needed.

This was necessary to prevent TCP port in use, and IPC conflicts as well. TCP on Windows is only working without this in your test environment because Windows socket address reuse is far more generous than POSIX.

mwpowellhtx commented 7 years ago

Unless you can show me how yours is changing, I don't buy it. I think there is a lock that is created that is not being cleared, and so the most obvious answer is that the sleep just waits and waits.

Even if my ports are not incrementing "properly", maybe they aren't, the resources from the previous listener/dialer should be cleared. I've checked my dialers and listeners and they are being closed after each respective section.

@gdamore What do you mean, "Windows is only work ..."? My friend, instead of passing command line/environment ports, I am passing them via compiler definitions. And then, when I focus on a test, that is the ONLY test I am running. I am not running the full battery of tests.

So... can you put some prints in, because I don't see in your code when the port should be incrementing. I just don't see any iterations there. Only stack/jumps that are being popped at the end of each "section". Not that different, in fact, analogous to how Catch operates.

gdamore commented 7 years ago
garrett@triton{11}% git diff ../tests/trantest.h 
diff --git a/tests/trantest.h b/tests/trantest.h
index 37763d2..d1a81ed 100644
--- a/tests/trantest.h
+++ b/tests/trantest.h
@@ -42,6 +42,7 @@ trantest_next_address(char *out, const char *template)
                }
        }
        (void) snprintf(out, NNG_MAXADDRLEN, template, trantest_port);
+printf("NEW ADDRESS IS %s\n", out);
        trantest_port++;
 }
garrett@triton{10}% ./tests/ipc 
NEW ADDRESS IS ipc:///tmp/nng_ipc_test_5555
NEW ADDRESS IS ipc:///tmp/nng_ipc_test_5556
NEW ADDRESS IS ipc:///tmp/nng_ipc_test_5557
NEW ADDRESS IS ipc:///tmp/nng_ipc_test_5558
NEW ADDRESS IS ipc:///tmp/nng_ipc_test_5559
NEW ADDRESS IS ipc:///tmp/nng_ipc_test_5560
NEW ADDRESS IS ipc:///tmp/nng_ipc_test_5561
ok      ./tests/ipc                                            0.056s
gdamore commented 7 years ago

It's time to start actually using a source debugger with your code, friend.

gdamore commented 7 years ago

Btw, here's what TCP does:

garrett@triton{13}% ./tests/tcp
NEW ADDRESS IS tcp://127.0.0.1:5555
NEW ADDRESS IS tcp://127.0.0.1:5556
NEW ADDRESS IS tcp://127.0.0.1:5557
NEW ADDRESS IS tcp://127.0.0.1:5558
NEW ADDRESS IS tcp://127.0.0.1:5559
NEW ADDRESS IS tcp://127.0.0.1:5560
NEW ADDRESS IS tcp://127.0.0.1:5561
NEW ADDRESS IS tcp://127.0.0.1:5562
NEW ADDRESS IS tcp://127.0.0.1:5563
NEW ADDRESS IS tcp://127.0.0.1:5564
NEW ADDRESS IS tcp://*:5565
NEW ADDRESS IS tcp://*:5566
NEW ADDRESS IS tcp://127.0.0.1:5566
ok      ./tests/tcp                                            0.335s

As I said, the only reason your Windows code for TCP possibly be working flawlessly is because Windows is fairly lax about socket address reuse. Even after an application closes the TCP port, there is a kernel state (FIN_WAIT) that normally is used to prevent stray packets from getting mis-directed to a newly opened socket. Windows is extremely lax here, letting multiple programs share a socket address by default.

mwpowellhtx commented 7 years ago

Of course I have been, but if I didn't see that I wouldn't believe it. I'll see how better to construct the Catch style sections.

gdamore commented 7 years ago

Oh, TCP tests are special (you'll notice on that does't increment) -- because at one point we reverse the port number so that we can intentionally reuse the port.