NA CCI: tcp transport poor performance, incorrect poll(2) handling

mercury-hpc / mercury

Mercury is a C library for implementing RPC, optimized for HPC.

http://www.mcs.anl.gov/projects/mercury/

BSD 3-Clause "New" or "Revised" License

169 stars 62 forks source link

NA CCI: tcp transport poor performance, incorrect poll(2) handling #158

Closed chuckcranor closed 5 years ago

chuckcranor commented 7 years ago

We are using mercury with cci+tcp for prototyping some of our deltafs work and have had some issues with it.

Generally, what is the state of the cci tcp plugin? Is it currently being worked on? Are there any upcoming fixes in the works? I looked at ctp_tcp_api.c and it looks like the code for handling non-blocking tcp sockets is incomplete and does not function properly. CCI uses poll(2) with non-blocking sockets to check for data. On the client side of the TCP socket it sets POLLOUT for the TCP connection to the server and never unsets it.

This results in the cci tcp_progress_thread spinning in a tight loop burning CPU by calling poll() and having it return immediately because POLLOUT is set and the socket is writable. If you are using ccp+tcp in an application that's real bad because CCI is wasting CPU cycles that should be used by your application.

You can easily verify this problem by putting a static counter in tcp_progress_thread() that counts the number of times it goes through the loop and prints the count when it exits (e.g. finalize time). You can also look at the system time reported by getrusage() with RUSAGE_THREAD when the thread exits. For my test program that simply does 100,000 RPCs, the tcp_progress_thread() ran through the loop about 9.5 million times and racked up 36 seconds of system time (out of 40 seconds the entire process took, so almost all of the system time went to the cci thread doing the poll(2) operation). ouch!

i read the code, but I'm not real clear on what the design of it is supposed to be. I think that somewhere there needs to be a state machine that turns on POLLOUT if you attempt to write data and get EWOULDBLOCK and it needs to turn off POLLOUT when you are no longer blocked. I'm also not clear on the locking off the fd poll event flags (if you want to change them).

I also noticed that tcp_recv_msg() has a tight loop, but it is limited by a 16 bit counter. If you tell it to receive 16 bytes, it calls recv() repeated until it gets the 16 bytes or it gets 65535 (aka 0xffff) EWOULDBLOCK/EAGAINerrors (then it fails). So it doesn't really cope with partial reads that well...

soumagne commented 7 years ago

@scottatchley Hi Scott, this issue is related to CCI/TCP itself apparently, could you please have a look? Have you also noticed that before? We thought it would be best to post it here as Chuck is also using mercury. Thanks!

scottatchley commented 7 years ago

Hi guys,

I will look at it today. I thought we had fixed this before.

Scott

On Feb 10, 2017, at 9:39 PM, Jerome Soumagne notifications@github.com wrote:

@scottatchley Hi Scott, this issue is related to CCI/TCP itself apparently, could you please have a look? Have you also noticed that before? We thought it would be best to post it here as Chuck is also using mercury. Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

scottatchley commented 7 years ago

Hi Chuck,

Are you using CCI from git or a release?

If I run CCI’s pingpong in server mode:

$ ./pingpong -s

I see 100% cpu usage:

PID COMMAND %CPU TIME #TH #WQ #PORT MEM PURG CMPRS PGRP PPID STATE 5241 pingpong 98.3 01:34.67 1/1 0 12 4592K 0B 0B 5241 2710 running

When blocking mode, I see:

$ ./pingpong -s -b

PID COMMAND %CPU TIME #TH #WQ #PORT MEM PURG CMPRS PGRP PPID STATE 5289 pingpong 0.0 00:00.00 2 0 13+ 4600K+ 0B 0B 5289 2710 sleeping

Is Mercury using the CCI endpoint’s file descriptor?

Scott

On Feb 10, 2017, at 8:14 PM, chuckcranor notifications@github.com wrote:

We are using mercury with cci+tcp for prototyping some of our deltafs work and have had some issues with it.

Generally, what is the state of the cci tcp plugin? Is it currently being worked on? Are there any upcoming fixes in the works? I looked at ctp_tcp_api.c and it looks like the code for handling non-blocking tcp sockets is incomplete and does not function properly. CCI uses poll(2) with non-blocking sockets to check for data. On the client side of the TCP socket it sets POLLOUT for the TCP connection to the server and never unsets it.

this results in the cci "tcp_progress_thread" spinning in a tight loop burning CPU by calling poll and having it return immediately because POLLOUT is set and the socket is writable. if you are using ccp+tcp in an application that's real bad because CCI is wasting CPU cycles that should be used by your application.

you can easily verify this problem by putting a static counter in tcp_progress_thread() that counts the number of times it goes through the loop and prints the count when it exits (e.g. finalize time). you can also look at the system time reported by getrusage() with RUSAGE_THREAD when the thread exits. for my test program that simply does 100,000 RPCs, the tcp_progress_thread() ran through the loop about 9.5 million times and racked up 36 seconds of system time (out of 40 seconds the entire process took, so almost all of the system time went to the cci thread doing the poll(2) operation). ouch!

i read the code, but i'm not real clear on what the design of it is supposed to be. I think that somewhere there needs to be a state machine that turns on POLLOUT if you attempt to write data and get EWOULDBLOCK. and it needs to turn off POLLOUT when you are no longer blocked. i'm also not clear on the locking off the fd poll event flags (if you want to change them).

I also noticed that tcp_recv_msg() has a tight loop, but it is limited by a 16 bit counter. if you tell it to receive 16 bytes, it calls recv() repeated until it gets the 16 bytes or it gets 65535 (aka 0xffff) EWOULDBLOCK/EAGAIN errors (then it fails). so it doesn't really cope with partial reads that well...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

chuckcranor commented 7 years ago

hi-

I am using CCI from git (I had trouble with the 2.0 release,

so switched to git).

Mercury na_cci_initialize() does the following:

rc = cci_init(CCI_ABI_VERSION, 0, &caps); rc = cci_get_devices(&devices); // to select the device

/* Create unspecified endpoint if service is set */
if (service)
    rc = cci_create_endpoint_at(device, service, 0, &endpoint, &fd);
else
    rc = cci_create_endpoint(device, 0, &endpoint, &fd);

the "fd" is the client end of the pipe that cci creates. I believe mercury uses that to poll cci.

The tight CPU loop is happening after the TCP connection has been established, so just running "pingpong -s -b" without starting a client won't trigger it.

Here is the latest debugging info on this problem:

I am attaching my debug patch for ctp_tcp_api.c. it is a quick hack that tracks changes in the poll(2) sets for the tcp_progress_thread() routine and other stats.

With my Mercury-based test program, on the server side I see the following:

STARTING CCI THREAD CCI: new nfds=1 0: fd=3, events=1 POLL: old1 was 0, now 1

this is before the client starts. fd 3 is tcp_progress_thread's end of the pipe it uses to communicate with mercury. it has set POLLIN (meaning that it is looking for input from mercury).

CCI: new nfds=2 0: fd=3, events=1 1: fd=7, events=1 POLL: old2 was 0, now 1 POLL: old2 was 1, now 5

this is after the client starts. fd=3 is still active with POLLIN, but now it has added fd=7 which is the active TCP connection to the client. it initially sets fd=7 to POLLIN ("now 1"). then it adds the POLLOUT bit ("now 5") so it is set to POLLIN|POLLOUT.

it doesn't change the poll fd structure after this.

when the process finalizes, I have the cci tcp thread use getrusage() to printout its CPU usage:

0: cci: times: wall=34.481450, usr=1.456000, sys=8.552000 (secs) 0: cci: minflt=31, majflt=0, inb=0, oub=0, vcw=37, ivcw=8

the overall usage for the entire process is:

0: ALL: times: wall=34.472164, usr=1.472000, sys=8.732000 (secs) 0: ALL: minflt=95, majflt=0, inb=0, oub=0, vcw=401, ivcw=10

so you can see that of the 8.732000 of system time the process uses, most of it goes to the cci tcp_progress_thread() thread.

i also put counters in the tcp_progress_thread() loop. c1 = total number of times through the loop c2 = number of times poll() returns <1 (no file descriptors ready, prob a timeout) c3 = number of times we called tcp_progress_ep().

at finalize time, the server reports:

CCI-STAT: c1=2570687, c2=24, c3=2570663

so it only timed out 24 times. this is likely due to POLLOUT being set on fd=7.

likewise, for the client:

STARTING CCI THREAD CCI: new nfds=1 0: fd=3, events=1 OK POLL FD = 4 POLL: old1 was 0, now 1

there is the pipe at fd=3, active before the TCP connection is started.

CCI: new nfds=2 0: fd=3, events=1 1: fd=7, events=4 POLL: old2 was 0, now 4 POLL: old2 was 4, now 5

there is the TCP connection going from POLLOUT to POLLOUT|POLLIN and it stays at POLLOUT|POLLIN for the rest of the run. when the client finishes sending its RPCs we finally get:

POLL: old2 was 5, now 0

(prob when it closes the TCP connection)

the loop counters for the client:

CCI-STAT: c1=2641140, c2=11, c3=2641129

also, if I put a sleep 3 between RPC calls on the client and do an "strace -p -f" I get endless screens of this kind of stuff:

[pid 21233] poll([{fd=3, events=POLLIN}, {fd=7, events=POLLIN|POLLOUT}], 2, 1000) = 1 ([{fd=7, revents=POLLOUT}]) [pid 21233] poll([{fd=7, events=POLLIN|POLLOUT}], 1, 0) = 1 ([{fd=7, revents=POLLOUT}]) [pid 21233] poll([{fd=3, events=POLLIN}, {fd=7, events=POLLIN|POLLOUT}], 2, 1000) = 1 ([{fd=7, revents=POLLOUT}]) [pid 21233] poll([{fd=3, events=POLLIN}], 1, 0) = 0 (Timeout)

For TCP, I think once the connection is established, you only want to set POLLOUT if you tried to write and got an EWOULDBLOCK (i.e. the socket buffer is full). You'd want to clear it once there is space in the socket buffer. You definitely don't want to leave it on full time like it is now.

chuck

diff --git a/src/plugins/ctp/tcp/ctp_tcp_api.c b/src/plugins/ctp/tcp/ctp_tcp_api.c index a60ff32..b81c2af 100644 --- a/src/plugins/ctp/tcp/ctp_tcp_api.c +++ b/src/plugins/ctp/tcp/ctp_tcp_api.c @@ -3737,24 +3737,123 @@ out: return ret; }

+/XXXUSEPROBE/ +#include <sys/time.h> +#include <sys/resource.h> + +/*

- start-end usage state
*/ +struct useprobe {
int who; / flag to getrusage /
struct timeval t0, t1;
struct rusage r0, r1; +};
+static void useprobe_start(struct useprobe *up, int who) {
up->who = who;
gettimeofday(&up->t0, NULL);
getrusage(up->who, &up->r0); +}
+static void useprobe_end(struct useprobe *up) {
gettimeofday(&up->t1, NULL);
getrusage(up->who, &up->r1); +}
+/*
- print usage
/ +void useprobe_print(struct useprobe up, char *tag, int n) {
double start, end;
double ustart, uend, sstart, send;
long nminflt, nmajflt, ninblock, noublock, nnvcsw, nnivcsw;
start = up->t0.tv_sec + (up->t0.tv_usec / 1000000.0);
end = up->t1.tv_sec + (up->t1.tv_usec / 1000000.0);
ustart = up->r0.ru_utime.tv_sec + (up->r0.ru_utime.tv_usec / 1000000.0);
uend = up->r1.ru_utime.tv_sec + (up->r1.ru_utime.tv_usec / 1000000.0);
sstart = up->r0.ru_stime.tv_sec + (up->r0.ru_stime.tv_usec / 1000000.0);
send = up->r1.ru_stime.tv_sec + (up->r1.ru_stime.tv_usec / 1000000.0);
nminflt = up->r1.ru_minflt - up->r0.ru_minflt;
nmajflt = up->r1.ru_majflt - up->r0.ru_majflt;
ninblock = up->r1.ru_inblock - up->r0.ru_inblock;
noublock = up->r1.ru_oublock - up->r0.ru_oublock;
nnvcsw = up->r1.ru_nvcsw - up->r0.ru_nvcsw;
nnivcsw = up->r1.ru_nivcsw - up->r0.ru_nivcsw;
printf("%d: %s: times: wall=%f, usr=%f, sys=%f (secs)\n", n, tag,
end - start, uend - ustart, send - sstart);
printf(
"%d: %s: minflt=%ld, majflt=%ld, inb=%ld, oub=%ld, vcw=%ld, ivcw=%ld\n",
n, tag, nminflt, nmajflt, ninblock, noublock, nnvcsw, nnivcsw); +} +/XXXUSEPROBE/
static void tcp_progress_thread(void arg) { cci__ep_t ep = (cci__ep_t ) arg; tcp_ep_t *tep = NULL;
struct useprobe use;
long int c1, c2, c3;

assert (ep);

tep = ep->priv;
printf("STARTING CCI THREAD\n");
c1 = c2 = c3 = 0;
useprobe_start(&use, RUSAGE_THREAD); while (!ep->closing) { int ret = 0;
c1++;
+{
static int nfds = 0;
static int old1 = 0;
static int old2 = 0;
int lcv;
if (tep->nfds != nfds) {
printf("CCI: new nfds=%d\n", tep->nfds);
if (nfds == 0) { old1 = old2 = 0; }
else if (nfds == 1) { old2 = 0; }
nfds = tep->nfds;
for (lcv = 0 ; lcv < tep->nfds ; lcv++) {
printf(" %d: fd=%d, events=%d\n", lcv, tep->fds[lcv].fd,
tep->fds[lcv].events);
}
}
if (nfds >= 1) {
if (tep->fds[0].events != old1) {
printf("POLL: old1 was %d, now %d\n", old1, tep->fds[0].events);
old1 = tep->fds[0].events;
}
}
if (nfds >= 2) {
if (tep->fds[1].events != old2) {
printf("POLL: old2 was %d, now %d\n", old2, tep->fds[1].events);
old2 = tep->fds[1].events;
}
} +} ret = poll(tep->fds, tep->nfds, 1000);
if (ret < 1)
if (ret < 1) {
c2++; continue;
}
c3++; tcp_progress_ep(ep); }
useprobe_end(&use);
useprobe_print(&use, "cci", 0);
printf("CCI-STAT: c1=%ld, c2=%ld, c3=%ld\n", c1, c2, c3);

pthread_exit(NULL); return (NULL); / make pgcc happy /

scottatchley commented 7 years ago

Ok, let me look after a connection is made.

On Feb 17, 2017, at 10:56 AM, chuckcranor notifications@github.com wrote:

hi-

I am using CCI from git (I had trouble with the 2.0 release, so switched to git).

Mercury na_cci_initialize() does the following:

rc = cci_init(CCI_ABI_VERSION, 0, &caps); rc = cci_get_devices(&devices); // to select the device

/ Create unspecified endpoint if service is set / if (service) rc = cci_create_endpoint_at(device, service, 0, &endpoint, &fd); else rc = cci_create_endpoint(device, 0, &endpoint, &fd);

the "fd" is the client end of the pipe that cci creates. I believe mercury uses that to poll cci.

The tight CPU loop is happening after the TCP connection has been established, so just running "pingpong -s -b" without starting a client won't trigger it.

Here is the latest debugging info on this problem:

I am attaching my debug patch for ctp_tcp_api.c. it is a quick hack that tracks changes in the poll(2) sets for the tcp_progress_thread() routine and other stats.

With my Mercury-based test program, on the server side I see the following:

STARTING CCI THREAD CCI: new nfds=1 0: fd=3, events=1 POLL: old1 was 0, now 1

this is before the client starts. fd 3 is tcp_progress_thread's end of the pipe it uses to communicate with mercury. it has set POLLIN (meaning that it is looking for input from mercury).

CCI: new nfds=2 0: fd=3, events=1 1: fd=7, events=1 POLL: old2 was 0, now 1 POLL: old2 was 1, now 5

this is after the client starts. fd=3 is still active with POLLIN, but now it has added fd=7 which is the active TCP connection to the client. it initially sets fd=7 to POLLIN ("now 1"). then it adds the POLLOUT bit ("now 5") so it is set to POLLIN|POLLOUT.

it doesn't change the poll fd structure after this.

when the process finalizes, I have the cci tcp thread use getrusage() to printout its CPU usage:

0: cci: times: wall=34.481450, usr=1.456000, sys=8.552000 (secs) 0: cci: minflt=31, majflt=0, inb=0, oub=0, vcw=37, ivcw=8

the overall usage for the entire process is:

0: ALL: times: wall=34.472164, usr=1.472000, sys=8.732000 (secs) 0: ALL: minflt=95, majflt=0, inb=0, oub=0, vcw=401, ivcw=10

so you can see that of the 8.732000 of system time the process uses, most of it goes to the cci tcp_progress_thread() thread.

i also put counters in the tcp_progress_thread() loop. c1 = total number of times through the loop c2 = number of times poll() returns <1 (no file descriptors ready, prob a timeout) c3 = number of times we called tcp_progress_ep().

at finalize time, the server reports:

CCI-STAT: c1=2570687, c2=24, c3=2570663

so it only timed out 24 times. this is likely due to POLLOUT being set on fd=7.

likewise, for the client:

STARTING CCI THREAD CCI: new nfds=1 0: fd=3, events=1 OK POLL FD = 4 POLL: old1 was 0, now 1

there is the pipe at fd=3, active before the TCP connection is started.

CCI: new nfds=2 0: fd=3, events=1 1: fd=7, events=4 POLL: old2 was 0, now 4 POLL: old2 was 4, now 5

there is the TCP connection going from POLLOUT to POLLOUT|POLLIN and it stays at POLLOUT|POLLIN for the rest of the run. when the client finishes sending its RPCs we finally get:

POLL: old2 was 5, now 0

(prob when it closes the TCP connection)

the loop counters for the client:

CCI-STAT: c1=2641140, c2=11, c3=2641129

also, if I put a sleep 3 between RPC calls on the client and do an "strace -p -f" I get endless screens of this kind of stuff:

[pid 21233] poll([{fd=3, events=POLLIN}, {fd=7, events=POLLIN|POLLOUT}], 2, 1000) = 1 ([{fd=7, revents=POLLOUT}]) [pid 21233] poll([{fd=7, events=POLLIN|POLLOUT}], 1, 0) = 1 ([{fd=7, revents=POLLOUT}]) [pid 21233] poll([{fd=3, events=POLLIN}, {fd=7, events=POLLIN|POLLOUT}], 2, 1000) = 1 ([{fd=7, revents=POLLOUT}]) [pid 21233] poll([{fd=3, events=POLLIN}], 1, 0) = 0 (Timeout)

For TCP, I think once the connection is established, you only want to set POLLOUT if you tried to write and got an EWOULDBLOCK (i.e. the socket buffer is full). You'd want to clear it once there is space in the socket buffer. You definitely don't want to leave it on full time like it is now.

chuck

diff --git a/src/plugins/ctp/tcp/ctp_tcp_api.c b/src/plugins/ctp/tcp/ctp_tcp_api.c index a60ff32..b81c2af 100644 --- a/src/plugins/ctp/tcp/ctp_tcp_api.c +++ b/src/plugins/ctp/tcp/ctp_tcp_api.c @@ -3737,24 +3737,123 @@ out: return ret; }

+/XXXUSEPROBE/ +#include <sys/time.h> +#include <sys/resource.h> + +/*

start-end usage state

*/ +struct useprobe {

int who; / flag to getrusage /

struct timeval t0, t1;

struct rusage r0, r1; +};

+static void useprobe_start(struct useprobe *up, int who) {

up->who = who;

gettimeofday(&up->t0, NULL);

getrusage(up->who, &up->r0); +}

+static void useprobe_end(struct useprobe *up) {

gettimeofday(&up->t1, NULL);

getrusage(up->who, &up->r1); +}

+/*

print usage

/ +void useprobe_print(struct useprobe up, char *tag, int n) {

double start, end;

double ustart, uend, sstart, send;

long nminflt, nmajflt, ninblock, noublock, nnvcsw, nnivcsw;

start = up->t0.tv_sec + (up->t0.tv_usec / 1000000.0);

end = up->t1.tv_sec + (up->t1.tv_usec / 1000000.0);

ustart = up->r0.ru_utime.tv_sec + (up->r0.ru_utime.tv_usec / 1000000.0);

uend = up->r1.ru_utime.tv_sec + (up->r1.ru_utime.tv_usec / 1000000.0);

sstart = up->r0.ru_stime.tv_sec + (up->r0.ru_stime.tv_usec / 1000000.0);

send = up->r1.ru_stime.tv_sec + (up->r1.ru_stime.tv_usec / 1000000.0);

nminflt = up->r1.ru_minflt - up->r0.ru_minflt;

nmajflt = up->r1.ru_majflt - up->r0.ru_majflt;

ninblock = up->r1.ru_inblock - up->r0.ru_inblock;

noublock = up->r1.ru_oublock - up->r0.ru_oublock;

nnvcsw = up->r1.ru_nvcsw - up->r0.ru_nvcsw;

nnivcsw = up->r1.ru_nivcsw - up->r0.ru_nivcsw;

printf("%d: %s: times: wall=%f, usr=%f, sys=%f (secs)\n", n, tag,

end - start, uend - ustart, send - sstart);

printf(

"%d: %s: minflt=%ld, majflt=%ld, inb=%ld, oub=%ld, vcw=%ld, ivcw=%ld\n",

n, tag, nminflt, nmajflt, ninblock, noublock, nnvcsw, nnivcsw); +} +/XXXUSEPROBE/

static void tcp_progress_thread(void arg) { cci__ep_t ep = (cci__ep_t ) arg; tcp_ep_t *tep = NULL;

struct useprobe use;

long int c1, c2, c3;

assert (ep);

tep = ep->priv;

printf("STARTING CCI THREAD\n");

c1 = c2 = c3 = 0;

useprobe_start(&use, RUSAGE_THREAD); while (!ep->closing) { int ret = 0;

c1++;

+{

static int nfds = 0;

static int old1 = 0;

static int old2 = 0;

int lcv;

if (tep->nfds != nfds) {

printf("CCI: new nfds=%d\n", tep->nfds);

if (nfds == 0) { old1 = old2 = 0; }

else if (nfds == 1) { old2 = 0; }

nfds = tep->nfds;

for (lcv = 0 ; lcv < tep->nfds ; lcv++) {

printf(" %d: fd=%d, events=%d\n", lcv, tep->fds[lcv].fd,

tep->fds[lcv].events);

}

}

if (nfds >= 1) {

if (tep->fds[0].events != old1) {

printf("POLL: old1 was %d, now %d\n", old1, tep->fds[0].events);

old1 = tep->fds[0].events;

}

}

if (nfds >= 2) {

if (tep->fds[1].events != old2) {

printf("POLL: old2 was %d, now %d\n", old2, tep->fds[1].events);

old2 = tep->fds[1].events;

}

} +} ret = poll(tep->fds, tep->nfds, 1000);

if (ret < 1)

if (ret < 1) {

c2++; continue;

}

c3++; tcp_progress_ep(ep); }

useprobe_end(&use);

useprobe_print(&use, "cci", 0);

printf("CCI-STAT: c1=%ld, c2=%ld, c3=%ld\n", c1, c2, c3);

pthread_exit(NULL); return (NULL); / make pgcc happy / — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

scottatchley commented 7 years ago

Ok, please try now.

When I pull a tx off of the connection’s send queue, I check if it is empty. If so, I remove POLLOUT from the connection's pollfd.

I separately paused both the CCI pingpong client and server to verify that the peer was not spinning.

On Feb 17, 2017, at 11:08 AM, Atchley, Scott atchleyes@ornl.gov wrote:

Ok, let me look after a connection is made.

On Feb 17, 2017, at 10:56 AM, chuckcranor notifications@github.com wrote:

hi-

I am using CCI from git (I had trouble with the 2.0 release, so switched to git).

Mercury na_cci_initialize() does the following:

rc = cci_init(CCI_ABI_VERSION, 0, &caps); rc = cci_get_devices(&devices); // to select the device

/ Create unspecified endpoint if service is set / if (service) rc = cci_create_endpoint_at(device, service, 0, &endpoint, &fd); else rc = cci_create_endpoint(device, 0, &endpoint, &fd);

the "fd" is the client end of the pipe that cci creates. I believe mercury uses that to poll cci.

The tight CPU loop is happening after the TCP connection has been established, so just running "pingpong -s -b" without starting a client won't trigger it.

Here is the latest debugging info on this problem:

I am attaching my debug patch for ctp_tcp_api.c. it is a quick hack that tracks changes in the poll(2) sets for the tcp_progress_thread() routine and other stats.

With my Mercury-based test program, on the server side I see the following:

STARTING CCI THREAD CCI: new nfds=1 0: fd=3, events=1 POLL: old1 was 0, now 1

this is before the client starts. fd 3 is tcp_progress_thread's end of the pipe it uses to communicate with mercury. it has set POLLIN (meaning that it is looking for input from mercury).

CCI: new nfds=2 0: fd=3, events=1 1: fd=7, events=1 POLL: old2 was 0, now 1 POLL: old2 was 1, now 5

this is after the client starts. fd=3 is still active with POLLIN, but now it has added fd=7 which is the active TCP connection to the client. it initially sets fd=7 to POLLIN ("now 1"). then it adds the POLLOUT bit ("now 5") so it is set to POLLIN|POLLOUT.

it doesn't change the poll fd structure after this.

when the process finalizes, I have the cci tcp thread use getrusage() to printout its CPU usage:

0: cci: times: wall=34.481450, usr=1.456000, sys=8.552000 (secs) 0: cci: minflt=31, majflt=0, inb=0, oub=0, vcw=37, ivcw=8

the overall usage for the entire process is:

0: ALL: times: wall=34.472164, usr=1.472000, sys=8.732000 (secs) 0: ALL: minflt=95, majflt=0, inb=0, oub=0, vcw=401, ivcw=10

so you can see that of the 8.732000 of system time the process uses, most of it goes to the cci tcp_progress_thread() thread.

i also put counters in the tcp_progress_thread() loop. c1 = total number of times through the loop c2 = number of times poll() returns <1 (no file descriptors ready, prob a timeout) c3 = number of times we called tcp_progress_ep().

at finalize time, the server reports:

CCI-STAT: c1=2570687, c2=24, c3=2570663

so it only timed out 24 times. this is likely due to POLLOUT being set on fd=7.

likewise, for the client:

STARTING CCI THREAD CCI: new nfds=1 0: fd=3, events=1 OK POLL FD = 4 POLL: old1 was 0, now 1

there is the pipe at fd=3, active before the TCP connection is started.

CCI: new nfds=2 0: fd=3, events=1 1: fd=7, events=4 POLL: old2 was 0, now 4 POLL: old2 was 4, now 5

there is the TCP connection going from POLLOUT to POLLOUT|POLLIN and it stays at POLLOUT|POLLIN for the rest of the run. when the client finishes sending its RPCs we finally get:

POLL: old2 was 5, now 0

(prob when it closes the TCP connection)

the loop counters for the client:

CCI-STAT: c1=2641140, c2=11, c3=2641129

also, if I put a sleep 3 between RPC calls on the client and do an "strace -p -f" I get endless screens of this kind of stuff:

[pid 21233] poll([{fd=3, events=POLLIN}, {fd=7, events=POLLIN|POLLOUT}], 2, 1000) = 1 ([{fd=7, revents=POLLOUT}]) [pid 21233] poll([{fd=7, events=POLLIN|POLLOUT}], 1, 0) = 1 ([{fd=7, revents=POLLOUT}]) [pid 21233] poll([{fd=3, events=POLLIN}, {fd=7, events=POLLIN|POLLOUT}], 2, 1000) = 1 ([{fd=7, revents=POLLOUT}]) [pid 21233] poll([{fd=3, events=POLLIN}], 1, 0) = 0 (Timeout)

For TCP, I think once the connection is established, you only want to set POLLOUT if you tried to write and got an EWOULDBLOCK (i.e. the socket buffer is full). You'd want to clear it once there is space in the socket buffer. You definitely don't want to leave it on full time like it is now.

chuck

diff --git a/src/plugins/ctp/tcp/ctp_tcp_api.c b/src/plugins/ctp/tcp/ctp_tcp_api.c index a60ff32..b81c2af 100644 --- a/src/plugins/ctp/tcp/ctp_tcp_api.c +++ b/src/plugins/ctp/tcp/ctp_tcp_api.c @@ -3737,24 +3737,123 @@ out: return ret; }

+/XXXUSEPROBE/ +#include <sys/time.h> +#include <sys/resource.h> + +/*

start-end usage state

*/ +struct useprobe {

int who; / flag to getrusage /

struct timeval t0, t1;

struct rusage r0, r1; +};

+static void useprobe_start(struct useprobe *up, int who) {

up->who = who;

gettimeofday(&up->t0, NULL);

getrusage(up->who, &up->r0); +}

+static void useprobe_end(struct useprobe *up) {

gettimeofday(&up->t1, NULL);

getrusage(up->who, &up->r1); +}

+/*

print usage

/ +void useprobe_print(struct useprobe up, char *tag, int n) {

double start, end;

double ustart, uend, sstart, send;

long nminflt, nmajflt, ninblock, noublock, nnvcsw, nnivcsw;

start = up->t0.tv_sec + (up->t0.tv_usec / 1000000.0);

end = up->t1.tv_sec + (up->t1.tv_usec / 1000000.0);

ustart = up->r0.ru_utime.tv_sec + (up->r0.ru_utime.tv_usec / 1000000.0);

uend = up->r1.ru_utime.tv_sec + (up->r1.ru_utime.tv_usec / 1000000.0);

sstart = up->r0.ru_stime.tv_sec + (up->r0.ru_stime.tv_usec / 1000000.0);

send = up->r1.ru_stime.tv_sec + (up->r1.ru_stime.tv_usec / 1000000.0);

nminflt = up->r1.ru_minflt - up->r0.ru_minflt;

nmajflt = up->r1.ru_majflt - up->r0.ru_majflt;

ninblock = up->r1.ru_inblock - up->r0.ru_inblock;

noublock = up->r1.ru_oublock - up->r0.ru_oublock;

nnvcsw = up->r1.ru_nvcsw - up->r0.ru_nvcsw;

nnivcsw = up->r1.ru_nivcsw - up->r0.ru_nivcsw;

printf("%d: %s: times: wall=%f, usr=%f, sys=%f (secs)\n", n, tag,

end - start, uend - ustart, send - sstart);

printf(

"%d: %s: minflt=%ld, majflt=%ld, inb=%ld, oub=%ld, vcw=%ld, ivcw=%ld\n",

n, tag, nminflt, nmajflt, ninblock, noublock, nnvcsw, nnivcsw); +} +/XXXUSEPROBE/

static void tcp_progress_thread(void arg) { cci__ep_t ep = (cci__ep_t ) arg; tcp_ep_t *tep = NULL;

struct useprobe use;

long int c1, c2, c3;

assert (ep);

tep = ep->priv;

printf("STARTING CCI THREAD\n");

c1 = c2 = c3 = 0;

useprobe_start(&use, RUSAGE_THREAD); while (!ep->closing) { int ret = 0;

c1++;

+{

static int nfds = 0;

static int old1 = 0;

static int old2 = 0;

int lcv;

if (tep->nfds != nfds) {

printf("CCI: new nfds=%d\n", tep->nfds);

if (nfds == 0) { old1 = old2 = 0; }

else if (nfds == 1) { old2 = 0; }

nfds = tep->nfds;

for (lcv = 0 ; lcv < tep->nfds ; lcv++) {

printf(" %d: fd=%d, events=%d\n", lcv, tep->fds[lcv].fd,

tep->fds[lcv].events);

}

}

if (nfds >= 1) {

if (tep->fds[0].events != old1) {

printf("POLL: old1 was %d, now %d\n", old1, tep->fds[0].events);

old1 = tep->fds[0].events;

}

}

if (nfds >= 2) {

if (tep->fds[1].events != old2) {

printf("POLL: old2 was %d, now %d\n", old2, tep->fds[1].events);

old2 = tep->fds[1].events;

}

} +} ret = poll(tep->fds, tep->nfds, 1000);

if (ret < 1)

if (ret < 1) {

c2++; continue;

}

c3++; tcp_progress_ep(ep); }

useprobe_end(&use);

useprobe_print(&use, "cci", 0);

printf("CCI-STAT: c1=%ld, c2=%ld, c3=%ld\n", c1, c2, c3);

pthread_exit(NULL); return (NULL); / make pgcc happy / — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

chuckcranor commented 7 years ago

On Fri, Feb 17, 2017 at 08:38:54AM -0800, Scott Atchley wrote:

Ok, please try now.

When I pull a tx off of the connection?s send queue, I check if it is empty. If so, I remove POLLOUT from the connection's pollfd.

I separately paused both the CCI pingpong client and server to verify that the peer was not spinning.

It has definitely improved in a quick test i just did. we will test it more and i'll advise if we see more problems.

thanks, chuck

chuckcranor commented 7 years ago

On Fri, Feb 17, 2017 at 08:38:54AM -0800, Scott Atchley wrote:

Ok, please try now.

so I did some more tests with this using Mercury over the weekend.
I varied the number of outstanding Mercury RPC requests issued in parallel from 1 to 256 and measured the sec per op and used getrusage() to measure how much system time was used by the client and server process.

These numbers were averaged over three runs where each run makes 100000 Mercury RPC calls (the RPC has a uint32 arg and a uint32 responce, so small buffers).

BMI gave the following baseline:

bmi 1 0.000201 sec per op, cli/srv sys time 6.277333 / 3.108000 sec bmi 2 0.000106 sec per op, cli/srv sys time 6.306667 / 3.286667 sec bmi 4 0.000063 sec per op, cli/srv sys time 6.508000 / 2.733333 sec bmi 8 0.000043 sec per op, cli/srv sys time 5.092000 / 1.625333 sec bmi 16 0.000040 sec per op, cli/srv sys time 4.812000 / 1.582667 sec bmi 32 0.000042 sec per op, cli/srv sys time 4.909333 / 1.632000 sec bmi 64 0.000040 sec per op, cli/srv sys time 4.990667 / 1.646667 sec bmi 128 0.000041 sec per op, cli/srv sys time 4.810667 / 1.564000 sec bmi 256 0.000043 sec per op, cli/srv sys time 4.902667 / 1.641333 sec

CCI (tcp) before the fix gave this:

cci 1 0.000336 sec per op, cli/srv sys time 48.781333 / 42.469333 sec cci 2 0.000148 sec per op, cli/srv sys time 29.180000 / 25.520000 sec cci 4 0.000083 sec per op, cli/srv sys time 21.242667 / 19.374667 sec cci 8 0.000052 sec per op, cli/srv sys time 17.024000 / 15.492000 sec cci 16 0.000050 sec per op, cli/srv sys time 16.708000 / 15.272000 sec cci 32 0.000050 sec per op, cli/srv sys time 16.418667 / 15.388000 sec cci 64 0.000048 sec per op, cli/srv sys time 16.273333 / 14.978667 sec cci 128 0.000049 sec per op, cli/srv sys time 16.185333 / 15.206667 sec cci 256 0.000049 sec per op, cli/srv sys time 16.348000 / 15.506667 sec

CCI (tcp) with the fix changes that to:

cci 1 0.000357 sec per op, cli/srv sys time 16.489333 / 9.104000 sec cci 2 0.000159 sec per op, cli/srv sys time 10.546000 / 8.368000 sec cci 4 0.000090 sec per op, cli/srv sys time 9.672000 / 8.197333 sec cci 8 0.000053 sec per op, cli/srv sys time 7.630667 / 6.398667 sec cci 16 0.000050 sec per op, cli/srv sys time 7.292000 / 6.325333 sec cci 32 0.000052 sec per op, cli/srv sys time 7.340000 / 6.454667 sec cci 64 0.000050 sec per op, cli/srv sys time 7.472000 / 6.474667 sec cci 128 0.000052 sec per op, cli/srv sys time 7.520000 / 6.476000 sec cci 256 0.000052 sec per op, cli/srv sys time 7.637333 / 6.472000 sec

Observations:

the fix reduces cci+tcp CPU time as hoped
cci+tcp still uses more resources than BMI even with the fix. it seems to me that the cci+tcp numbers should be closer to BMI.
for some reason I haven't figured out, the cci+tcp run sometimes hangs if the max number of concurrent RPC ops is 2. it just sits there until the alarm(3) timer goes off. I had that happen both with and without the cci+tcp... not sure what is going on there.

I also ran the test with 4 instances of Mercury active in each process (e.g. the client process has independent client threads in it, each client thread making 100000 RPC calls):

CCI (tcp) with fix, 4 instances of Mercury active in each process:

cci 1 0.000435 sec per op, cli/srv sys time 65.830667 / 39.286667 sec cci 2 sec per op, cli/srv sys time / sec cci 4 0.000168 sec per op, cli/srv sys time 46.600000 / 35.685333 sec cci 8 0.000140 sec per op, cli/srv sys time 39.313333 / 33.645333 sec cci 16 0.000120 sec per op, cli/srv sys time 34.037333 / 32.190667 sec cci 32 0.000121 sec per op, cli/srv sys time 34.101333 / 31.993333 sec cci 64 0.000119 sec per op, cli/srv sys time 33.690667 / 31.216000 sec cci 128 0.000112 sec per op, cli/srv sys time 32.116000 / 30.801333 sec cci 256 0.000107 sec per op, cli/srv sys time 31.149333 / 29.938667 sec

the lack of data for the "2" line is because all 3 runs hung as described above...

chuck

scottatchley commented 7 years ago

Hi Chuck,

I have not looked at the BMI TCP protocol implementation, but one reason that CCI’s TCP transport may have different performance is that a CCI send or RMA does not complete until the data is resident in the peer’s memory (i.e. been acked by the peer). If BMI acks after calling send() without waiting for a BMI-level ack from the peer, it will complete faster than CCI. But if the data is buffered and the connection breaks, the data might not have been delivered.

Phil (@carns), can you shed any light on this theory?

Scott

On Feb 20, 2017, at 1:07 PM, chuckcranor notifications@github.com wrote:

On Fri, Feb 17, 2017 at 08:38:54AM -0800, Scott Atchley wrote:

Ok, please try now.

so I did some more tests with this using Mercury over the weekend. I varied the number of outstanding Mercury RPC requests issued in parallel from 1 to 256 and measured the sec per op and used getrusage() to measure how much system time was used by the client and server process.

These numbers were averaged over three runs where each run makes 100000 Mercury RPC calls (the RPC has a uint32 arg and a uint32 responce, so small buffers).

BMI gave the following baseline:

bmi 1 0.000201 sec per op, cli/srv sys time 6.277333 / 3.108000 sec bmi 2 0.000106 sec per op, cli/srv sys time 6.306667 / 3.286667 sec bmi 4 0.000063 sec per op, cli/srv sys time 6.508000 / 2.733333 sec bmi 8 0.000043 sec per op, cli/srv sys time 5.092000 / 1.625333 sec bmi 16 0.000040 sec per op, cli/srv sys time 4.812000 / 1.582667 sec bmi 32 0.000042 sec per op, cli/srv sys time 4.909333 / 1.632000 sec bmi 64 0.000040 sec per op, cli/srv sys time 4.990667 / 1.646667 sec bmi 128 0.000041 sec per op, cli/srv sys time 4.810667 / 1.564000 sec bmi 256 0.000043 sec per op, cli/srv sys time 4.902667 / 1.641333 sec

CCI (tcp) before the fix gave this:

cci 1 0.000336 sec per op, cli/srv sys time 48.781333 / 42.469333 sec cci 2 0.000148 sec per op, cli/srv sys time 29.180000 / 25.520000 sec cci 4 0.000083 sec per op, cli/srv sys time 21.242667 / 19.374667 sec cci 8 0.000052 sec per op, cli/srv sys time 17.024000 / 15.492000 sec cci 16 0.000050 sec per op, cli/srv sys time 16.708000 / 15.272000 sec cci 32 0.000050 sec per op, cli/srv sys time 16.418667 / 15.388000 sec cci 64 0.000048 sec per op, cli/srv sys time 16.273333 / 14.978667 sec cci 128 0.000049 sec per op, cli/srv sys time 16.185333 / 15.206667 sec cci 256 0.000049 sec per op, cli/srv sys time 16.348000 / 15.506667 sec

CCI (tcp) with the fix changes that to:

cci 1 0.000357 sec per op, cli/srv sys time 16.489333 / 9.104000 sec cci 2 0.000159 sec per op, cli/srv sys time 10.546000 / 8.368000 sec cci 4 0.000090 sec per op, cli/srv sys time 9.672000 / 8.197333 sec cci 8 0.000053 sec per op, cli/srv sys time 7.630667 / 6.398667 sec cci 16 0.000050 sec per op, cli/srv sys time 7.292000 / 6.325333 sec cci 32 0.000052 sec per op, cli/srv sys time 7.340000 / 6.454667 sec cci 64 0.000050 sec per op, cli/srv sys time 7.472000 / 6.474667 sec cci 128 0.000052 sec per op, cli/srv sys time 7.520000 / 6.476000 sec cci 256 0.000052 sec per op, cli/srv sys time 7.637333 / 6.472000 sec

Observations:

the fix reduces cci+tcp CPU time as hoped

cci+tcp still uses more resources than BMI even with the fix. it seems to me that the cci+tcp numbers should be closer to BMI.

for some reason I haven't figured out, the cci+tcp run sometimes hangs if the max number of concurrent RPC ops is 2. it just sits there until the alarm(3) timer goes off. I had that happen both with and without the cci+tcp... not sure what is going on there.

I also ran the test with 4 instances of Mercury active in each process (e.g. the client process has independent client threads in it, each client thread making 100000 RPC calls):

CCI (tcp) with fix, 4 instances of Mercury active in each process:

cci 1 0.000435 sec per op, cli/srv sys time 65.830667 / 39.286667 sec cci 2 sec per op, cli/srv sys time / sec cci 4 0.000168 sec per op, cli/srv sys time 46.600000 / 35.685333 sec cci 8 0.000140 sec per op, cli/srv sys time 39.313333 / 33.645333 sec cci 16 0.000120 sec per op, cli/srv sys time 34.037333 / 32.190667 sec cci 32 0.000121 sec per op, cli/srv sys time 34.101333 / 31.993333 sec cci 64 0.000119 sec per op, cli/srv sys time 33.690667 / 31.216000 sec cci 128 0.000112 sec per op, cli/srv sys time 32.116000 / 30.801333 sec cci 256 0.000107 sec per op, cli/srv sys time 31.149333 / 29.938667 sec

the lack of data for the "2" line is because all 3 runs hung as described above...

chuck

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

carns commented 7 years ago

In the BMI library itself, it will indeed indicate local completion of a send without confirming arrival in memory at the remote host. Completing a send just means that the buffer is safe to reuse for some other purpose (similar to MPI semantics). This is for a conventional 2 sided operation; BMI itself doesn't have an RMA API.

I think that the na_bmi implementation in Mercury does ack puts before completion (right @soumagne ?), though. In other words: for Mercury req and ack, the sends will complete as soon as they are transmitted, but for Mercury bulk operations, the put will not complete until the peer has confirmed arrival.

chuckcranor commented 7 years ago

On Mon, Feb 20, 2017 at 12:17:29PM -0800, carns wrote:

I think that the na_bmi implementation in Mercury does ack puts before completion (right @soumagne ?), though. In other words: for Mercury req and ack, the sends will complete as soon as they are transmitted, but for Mercury bulk operations, the put will not complete until the peer has confirmed arrival.

If I understand what you are saying, it is Mercury itself that handles this (not the BMI NA layer):

If you do HG_Forward(handle, callback_fn, arg, &input_struct)

it:

encodes "input_struct" into handle
does an expected recv (for the reply)
does an unexpected send (for the request)

and returns. The calls is active using the handle (which you can use to cancel it).

But the "callback_fn" isn't called by Mercury until we get the reply from the remote RPC server. (the callback is called in the context of the Mercury trigger function....)

(This is just normal RPC, not using Mercury bulk operations)

chuck

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/mercury-hpc/mercury/issues/158#issuecomment-281172541

carns commented 7 years ago

I think that's right for normal RPCs without bulk transfers.

If you were to add in a bulk operation I think there would be an additional round of handshaking (implemented in between libbmi and Mercury, in the na_bmi.c code). I believe that the bulk case with BMI does use an extra ack because hypothetically you could run multiple bulk transfers independent of the of the RPC req/ack sequence, so you can't rely on the presence of an implied higher level ack in the bulk xfer case like you can for pure RPCs.

soumagne commented 7 years ago

Right just to confirm what Phil was mentioning, when doing a bulk transfer with BMI, a request gets first sent and then the data is either sent or recv'ed with an additional ack being sent in the case of a put. But since Chuck you're talking about RPC without bulk data, nothing else happens in that case at the NA BMI level, except calling the matching BMI function.

chuckcranor commented 7 years ago

On Mon, Feb 20, 2017 at 12:41:37PM -0800, carns wrote:

I think that's right for normal RPCs without bulk transfers.

ok. The numbers I'm looking at are just for normal right now, no Mercury bulk operations:

e.g. doing 100000 small Mercury RPCs one at a time gives:

BMI: 0.000201 sec per op, cli/srv sys time 6.277333 / 3.108000 sec

CCI: 0.000357 sec per op, cli/srv sys time 16.489333 / 9.104000 sec

chuck

carns commented 7 years ago

Ok, then yes, that difference in handshaking is almost certainly a factor. I just re-read Scott's description and saw that CCI send() waits for confirmation of data in remote memory, not just the RMA operations. So CCI has an extra step that isn't present in BMI in this case.

chuckcranor commented 7 years ago

On Mon, Feb 20, 2017 at 01:04:35PM -0800, carns wrote:

Ok, then yes, that difference in handshaking is almost certainly a factor.

I traced a single RPC and I got this:

BMI TOTALS: -> 2 TCP round trips (in 4 TCP packets) -> 2 writev system calls (send on client, send reply on server) -> 6 recvfrom/readv system calls (MSG_PEEK 25, recv 25, readv the rest)

CCI TOTALS: -> 4 TCP round trips (in 8 TCP packets) -> 4 sendto system calls (client send req, server ack req, srvr send reply, cli reply ack) -> 6 recvfrom system calls (srvr read req header, srvr read req body, client read req ack, client read rep header, client read rep body, srvr read rep ack) -> 4 writes/4 reads of char 'a' on the pipe between cci's thread and mercury.

so lots more work being done by CCI.

I also noticed from strace a really long string of repeated system calls to clock_gettime() on the cci client which looks odd.
e.g. compare bmi and cci:

% fgrep clock_gettime bmi/trace.cli | wc -l 1092 % fgrep clock_gettime cci/cli.trace | wc -l 5905

of those 5905 calls to clock_gettime(), 5367 of them are made in one big block without any other system call between them.
it appears to be happening in the thread running HG_Progress/HG_Trigger. hmmm.

here's some details:

BMI:

client server

/ write request / writev(5, [{"",25},{"",20},2) = 45

                       /* request on network */
                         -TCP(length=45)->
                         <-TCP(len=0,ack)-

                                   /* read request */
                                   recvfrom(6, "", 25, MSG_PEEK) = 25
                                   recvfrom(6, "", 25) = 25
                                   readv(6, [{"",20}], 1) = 20

                                   /* send reply */
                                   writev(6, [{"",25}{"",16}], 2) = 41

                       /* reply on network */
                         <-TCP(len=41)-
                         -TCP(len=0,ack)->

/ read reply / recvfrom(5, "", 25, MSG_PEEK) = 25 recvfrom(5, "", 25) = 25 readv(5, [{""},16], 1) = 16

CCI: client server

/ write request / sendto(7, "", 32) = 32 / request on network / -TCP(length=32)-> <-TCP(len=0,ack)-

                                   /* read request */
                                   recvfrom(7, "", 8) = 8
                                   recvfrom(7, "", 24) = 24
                                   /* tell mercury */
                                   write(5, "a", 1) // pipe
                                   /* ack the request */
                                   sendto(7, "", 8) = 8
                                   read(4, "a", 1) // pipe

                      /* request ack on network */
                         <-TCP(len=8)-
                         -TCP(len=0,ack)->

/ read ack of request / recvfrom(7, "", 8) = 8 / tell mercury / write(5, "a", 1) read(4, "a", 1) / send RPC reply / sendto(7, "", 28) = 28

                         /* reply on network */
                         <-TCP(len=28)-
                         -TCP(len=0,ack)->

/ read reply / recvfrom(7, "", 8) = 8 recvfrom(7, "", 20) = 20 / tell mercury / write(5, "a", 1) read(4, "a", 1)

/ ack the reply / sendto(7, "", 8) / reply ack on network / -TCP(len=8)-> <-TCP(len=0,ack)-

                                   / read ack of reply */
                                   recvfrom(7 "", 8) = 8
                                   /* tell mercury */
                                   write(5, "a", 1)
                                   read(4, "a", 1)

soumagne commented 7 years ago

One difference also is that BMI and CCI plugins have slightly different code paths in mercury as CCI exposes a "fd" that we can use to poll and make progress. So when using BMI, hg_core_progress_na() will be used, whereas when using CCI, it will be hg_core_progress_poll(). It is possible that there is a different number of calls to hg_time_get_current() / clock_gettime() then, although that seems relatively much higher in that case...

chuckcranor commented 7 years ago

On Mon, Feb 20, 2017 at 09:15:58PM -0800, Jerome Soumagne wrote:

One difference also is that BMI and CCI plugins have slightly different code paths in mercury as CCI exposes a "fd" that we can use to poll and make progress. So when using BMI, hg_core_progress_na() will be used, whereas when using CCI, it will be hg_core_progress_poll(). It is possible that there is a different number of calls to hg_time_get_current() / clock_gettime() then, although that seems relatively much higher in that case...

for that block of clock_gettime() calls, I think it is spinning in na_cci_progress(). I got a stack trace in the middle of the block of clock_gettime() calls:

HG_Core_progress() hg_core_progress_poll() hg_poll_wait() hg_core_progress_na_cb() NA_Progress() na_cci_progress() ht_time_get_current()

looking at the na_cci_progress() source, I think it is busy polling cci_get_event(). if cci_get_event() returns CCI_EAGAIN it's going to loop back and call again and again until the timeout or an event comes in.

Should it do that if we are using the exposed "fd"?

chuck

scottatchley commented 7 years ago

Jerome,

Is it possible that it received a POLLIN on the CCI fd and went into polling mode until an event was found? Some internal CCI traffic can trigger a POLLIN that does not result in a CCI event being returned.

Scott

On Feb 21, 2017, at 10:43 AM, chuckcranor notifications@github.com wrote:

On Mon, Feb 20, 2017 at 09:15:58PM -0800, Jerome Soumagne wrote:

One difference also is that BMI and CCI plugins have slightly different code paths in mercury as CCI exposes a "fd" that we can use to poll and make progress. So when using BMI, hg_core_progress_na() will be used, whereas when using CCI, it will be hg_core_progress_poll(). It is possible that there is a different number of calls to hg_time_get_current() / clock_gettime() then, although that seems relatively much higher in that case...

for that block of clock_gettime() calls, I think it is spinning in na_cci_progress(). I got a stack trace in the middle of the block of clock_gettime() calls:

HG_Core_progress() hg_core_progress_poll() hg_poll_wait() hg_core_progress_na_cb() NA_Progress() na_cci_progress() ht_time_get_current()

looking at the na_cci_progress() source, I think it is busy polling cci_get_event(). if cci_get_event() returns CCI_EAGAIN it's going to loop back and call again and again until the timeout or an event comes in.

Should it do that if we are using the exposed "fd"?

chuck — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

soumagne commented 7 years ago

I see, yes if POLLIN on the CCI fd is received, it goes into na_cci_progress() assuming that there is an event.

soumagne commented 7 years ago

OK I guess we don't need the loop then anymore

soumagne commented 7 years ago

Although maybe we do and we should just exit if CCI_EAGAIN is found?

scottatchley commented 7 years ago

I think so.

On Feb 21, 2017, at 11:35 AM, Jerome Soumagne notifications@github.com wrote:

Although maybe we do and we should just exit if CCI_EAGAIN is found?

soumagne commented 7 years ago

Chuck, can you try again with 4821fe9 that I just pushed?

chuckcranor commented 7 years ago

On Tue, Feb 21, 2017 at 08:50:46AM -0800, Jerome Soumagne wrote:

Chuck, can you try again with 4821fe9 that I just pushed?

yes, that reduced the time calls quite a bit:

% fgrep clock_gettime /tmp/okok2 | wc 7334 51338 535203 % fgrep clock_gettime /tmp/okok3 | wc 807 5649 58806 %

"okok3" is the one with 4821fe9.

chuck

soumagne commented 7 years ago

Thanks! I'll merge it to master then. Thanks again Scott for the insight!

chuckcranor commented 7 years ago

On Tue, Feb 21, 2017 at 08:50:46AM -0800, Jerome Soumagne wrote:

Chuck, can you try again with 4821fe9 that I just pushed?

here's more data:

BEFORE (varying number of RPCs that can be sent in parallel): cci 1 0.000357 sec per op, cli/srv sys time 16.489333 / 9.104000 sec cci 2 0.000159 sec per op, cli/srv sys time 10.546000 / 8.368000 sec cci 4 0.000090 sec per op, cli/srv sys time 9.672000 / 8.197333 sec cci 8 0.000053 sec per op, cli/srv sys time 7.630667 / 6.398667 sec cci 16 0.000050 sec per op, cli/srv sys time 7.292000 / 6.325333 sec cci 32 0.000052 sec per op, cli/srv sys time 7.340000 / 6.454667 sec cci 64 0.000050 sec per op, cli/srv sys time 7.472000 / 6.474667 sec cci 128 0.000052 sec per op, cli/srv sys time 7.520000 / 6.476000 sec cci 256 0.000052 sec per op, cli/srv sys time 7.637333 / 6.472000 sec

AFTER (varying number of RPCs that can be sent in parallel): cci 1 0.000353 sec per op, cli/srv sys time 11.266667 / 8.842667 sec cci 2 sec per op, cli/srv sys time / 8.496000 sec cci 4 0.000090 sec per op, cli/srv sys time 8.510667 / 8.072000 sec cci 8 0.000054 sec per op, cli/srv sys time 7.416000 / 6.557333 sec cci 16 0.000051 sec per op, cli/srv sys time 7.342667 / 6.369333 sec cci 32 0.000050 sec per op, cli/srv sys time 7.270667 / 6.209333 sec cci 64 0.000049 sec per op, cli/srv sys time 7.212000 / 6.013333 sec cci 128 0.000049 sec per op, cli/srv sys time 7.158667 / 6.036000 sec cci 256 0.000050 sec per op, cli/srv sys time 7.349333 / 6.048000 sec

note the reduction of CPU system time (esp. the client which drops from 16.5 to 11.3 seconds in the serialized RPC case).

Unfortunately, I'm still seeing some hanging (which is why I don't have any data for 2). I have not tried to track that down yet....

also, the system CPU time is still more than the BMI...

bmi 1 0.000202 sec per op, cli/srv sys time 6.448000 / 3.221333 sec bmi 2 0.000106 sec per op, cli/srv sys time 6.680000 / 3.297333 sec bmi 4 0.000064 sec per op, cli/srv sys time 6.738667 / 2.754667 sec bmi 8 0.000044 sec per op, cli/srv sys time 5.276000 / 1.721333 sec bmi 16 0.000041 sec per op, cli/srv sys time 5.136000 / 1.541333 sec bmi 32 0.000040 sec per op, cli/srv sys time 4.809333 / 1.586667 sec bmi 64 0.000043 sec per op, cli/srv sys time 5.264000 / 1.593333 sec bmi 128 0.000042 sec per op, cli/srv sys time 5.146667 / 1.593333 sec bmi 256 0.000042 sec per op, cli/srv sys time 4.436000 / 1.568000 sec

chuck

soumagne commented 7 years ago

Actually tests are hanging too now, see https://cdash.hdfgroup.org/index.php?project=Mercury (for build number 607), I guess sometimes we're not waking up now...

scottatchley commented 7 years ago

This might be hitting another issue within the TCP transport mentioned on an older thread. Let me look.

For now, revert the last change where it returns if EAGAIN is found.

On Feb 21, 2017, at 2:39 PM, Jerome Soumagne notifications@github.com wrote:

Actually tests are hanging too now, see https://cdash.hdfgroup.org/index.php?project=Mercury (for build number 607), I guess sometimes we're not waking up now...

soumagne commented 7 years ago

OK I reverted the fix for now

scottatchley commented 7 years ago

Hi Jerome,

Can you point me to the test that is failing? I would like to reproduce here to aid debugging.

Thanks,

Scott

On Feb 21, 2017, at 3:28 PM, Jerome Soumagne notifications@github.com wrote:

OK I reverted the fix for now

soumagne commented 7 years ago

Hi Scott,

you can see the tests here: https://cdash.hdfgroup.org/index.php?project=Mercury&date=2017-02-21 (build 607), https://cdash.hdfgroup.org/viewTest.php?onlyfailed&buildid=2077, it looks like these were mostly the tests that involve a bulk transfer.

For example: https://cdash.hdfgroup.org/testDetails.php?test=30679&build=2077

Thanks!

Jerome

scottatchley commented 7 years ago

Ok, so I will download the latest code, build with CCI, and then run the tests.

Thanks.

On Feb 27, 2017, at 11:55 AM, Jerome Soumagne notifications@github.com wrote:

Hi Scott, you can see the tests here: https://cdash.hdfgroup.org/index.php?project=Mercury&date=2017-02-21 (build 607), https://cdash.hdfgroup.org/viewTest.php?onlyfailed&buildid=2077, it looks like these were mostly the tests that involve a bulk transfer. For example: https://cdash.hdfgroup.org/testDetails.php?test=30679&build=2077

scottatchley commented 7 years ago

Hi Jerome,

I just pushed a fix to github for the TCP transport. I remove the per-connection lock and use the endpoint lock instead. Also, when in blocking mode, I copy the pollfds and poll on the copy. This seems to eliminate all of the thread race conditions.

With this fix, all Mercury tests complete. Can you try it out? If they work, can you also try with your commit that avoids spinning inside na_cci?

Scott

Ok, so I will download the latest code, build with CCI, and then run the tests.

Thanks.

On Feb 27, 2017, at 11:55 AM, Jerome Soumagne notifications@github.com wrote:

Hi Scott, you can see the tests here: https://cdash.hdfgroup.org/index.php?project=Mercury&date=2017-02-21 (build 607), https://cdash.hdfgroup.org/viewTest.php?onlyfailed&buildid=2077, it looks like these were mostly the tests that involve a bulk transfer. For example: https://cdash.hdfgroup.org/testDetails.php?test=30679&build=2077

soumagne commented 7 years ago

Hi Scott,

I'm running the tests through travis now so it may take a while before it completes as it has to recompile cci, I will let you know the results.

Thanks!

Jerome

scottatchley commented 7 years ago

Jerome,

The ctest completes successfully on a RHEL6.7 box. It hangs on my MacOSX laptop. If I run tests manually, the client completes and the server hangs. If I run it under lldb, there is only one thread (if there is a progress thread in Mercury, it has exited, and the TCP progress thread has exited). Its backtrace is:

(lldb) bt

thread #1: tid = 0xf7fd2, 0x00007fff92104eca libsystem_kernel.dylib`kevent + 10, stop reason = signal SIGSTOP
- frame #0: 0x00007fff92104eca libsystem_kernel.dylibkevent + 10 frame #1: 0x0000000100038111 libmercury_util.0.9.0.dylibhg_poll_wait(poll_set=0x0000000100200e40, timeout=3600000, progressed="") + 129 at mercury_poll.c:405 [opt] frame #2: 0x00000001000147de libmercury.0.9.0.dylibhg_core_progress_poll(context=<unavailable>, timeout=<unavailable>) + 94 at mercury_core.c:2085 [opt] frame #3: 0x0000000100014b6b libmercury.0.9.0.dylibHG_Core_context_destroy + 107 at mercury_core.c:823 [opt] frame #4: 0x0000000100014b00 libmercury.0.9.0.dylibHG_Core_context_destroy(context=<unavailable>) + 272 at mercury_core.c:2590 [opt] frame #5: 0x000000010000f7cb libmercury_hl.0.9.0.dylibHG_Hl_finalize + 43 at mercury_hl.c:270 [opt] frame #6: 0x00000001000032d7 hg_test_serverHG_Test_finalize(hg_class=<unavailable>) + 503 at mercury_test.c:419 [opt] frame #7: 0x00000001000029f6 hg_test_servermain(argc=, argv=) + 150 at test_server.c:87 [opt] frame #8: 0x00007fff9211e5ad libdyld.dylibstart + 1 frame #9: 0x00007fff9211e5ad libdyld.dylibstart + 1

I suspect that it may be that it is waiting for an event from CCI that was dropped. If so, I know what needs to be done.

Scott

soumagne commented 7 years ago

it looks like that's the case, I'm getting that problem on finalize too it seems, see for example, https://cdash.hdfgroup.org/testDetails.php?test=32783&build=2264 the test ran correctly and we got 4 correct bulk transfers, but then it seems to be hanging in the finalize... I could not reproduce it locally though all the time (sometimes passes, sometimes not)

scottatchley commented 7 years ago

Ok. I am working on not dropping these items.