m-lab / ndt

Network Diagnostic Tool
Other
11 stars 7 forks source link

No more freezing if pcap process goes wrong #51

Closed pboothe closed 8 years ago

pboothe commented 8 years ago

If packet capture fails to start for whatever reason, then the whole client gets hung in a read. When enough clients are hung, then all incoming clients get queued, and the queue never gets any smaller.

This adds a select() in front of the read() to ensure that the pipe becomes readable within a second. Thus, if the packet trace is unable to start, the test can still proceed without it.


This change is Reviewable

mtlynch commented 8 years ago

I felt like the commit description on 5bce8cd was the most helpful in giving context for this PR. Can we integrate that into the PR description? I'm not sure if you're planning to do that as part of squash & merge, but I feel like it's helpful to aim for the final PR description as early as possible so the reviewer's getting the same context that someone would get when they go back to read this (squashed) commit in the change history.


Review status: 0 of 1 files reviewed at latest revision, 7 unresolved discussions.


_src/test_c2s_srv.c, line 533 [r1] (raw file):_ This function is obscenely long already. We're adding 3 major chunks, so I think each of them should be reduced to (at most) single line function calls with the logic in a helper function.

(optional) If possible, we should make the net effect on test_c2s's size negative (e.g. the next 30ish lines seem refactorable into a function even though we're not otherwise touching the middle 10 lines).


_src/test_c2s_srv.c, line 699 [r1] (raw file):_ Can we update this comment?


_src/test_c2s_srv.c, line 723 [r1] (raw file):_ Can we refactor to avoid duplicating 704-708?


_src/test_c2s_srv.c, line 727 [r1] (raw file):_ (not your fault, but while we're touching this) It seems like we reach this block when select times out. Is that intentional? Both msgretvalue and errno should be 0 in that case. If this is intentional, can we document that, because it's a bit unintuitive.


_src/test_c2s_srv.c, line 738 [r1] (raw file):_ Was this referring to what is now 725?


_src/test_c2s_srv.c, line 766 [r1] (raw file):_ Can we document why we don't care if this blocks?


_src/test_c2s_srv.c, line 767 [r1] (raw file):_ Suggest packet_trace_running = 0 for good measure.


_Comments from Reviewable_

pboothe commented 8 years ago

Review status: 0 of 1 files reviewed at latest revision, 7 unresolved discussions.


_src/test_c2s_srv.c, line 533 [r1] (raw file):_ I agree that this function is a dumpster fire. However, bugfixes should not be held up by cleanups.

Aside from the fact that refactoring a 530 line function is not what I signed up for as fixer of this bug, cleanups entail more risk than I think we can tolerate at this time. We need this bug to be fixed ASAP, and fixed in a way that definitely doesn't introduce any new bugs. Every cleanup I have ever done of code in NDT has revealed hidden assumptions and dependencies between far-flung pieces of the program, and has broken a lot of things that worked by accident. I don't want that to happen here and prevent the timely rollout of a fix.


_src/test_c2s_srv.c, line 699 [r1] (raw file):_ I have no idea what this comment ever meant. I can delete it, but can not update it.


_src/test_c2s_srv.c, line 723 [r1] (raw file):_ You're proposing spending 12 lines to avoid spending 4, so I'd prefer to keep as is. If we start engaging in cleanups we will be here for weeks.


_src/test_c2s_srv.c, line 727 [r1] (raw file):_ It is intentional, and is a preservation of existing behavior. Documented.


_src/test_c2s_srv.c, line 738 [r1] (raw file):_ It was and is referring to the read() on the next line.


_src/test_c2s_srv.c, line 766 [r1] (raw file):_ I am preserving existing behavior, and I have no idea whether later pieces of code will reliably work okay if I change this IO to non-blocking.

The only reason it's okay to change the other one to non-blocking is because the alternative is never calling stop_packet_trace() at all, and so we mitigate the risk of resource leaks by calling it in non-blocking mode as part of an obscure error pathway.

I would rate changing the blocking/non-blocking status of this particular write as a risky change. Every time I have changed a piece of NDT code in a way that affects threading and/or IPC it has caused bugs, sometimes a lot and in code that is very distant, and that's what the blocking/non-blocking change affects.


_src/test_c2s_srv.c, line 767 [r1] (raw file):_ Done.


Comments from Reviewable

mtlynch commented 8 years ago

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion.


_src/test_c2s_srv.c, line 533 [r1] (raw file):_

Previously, pboothe (Peter Boothe) wrote… > I agree that this function is a dumpster fire. However, bugfixes should not be held up by cleanups. > > Aside from the fact that refactoring a 530 line function is not what I signed up for as fixer of this bug, cleanups entail more risk than I think we can tolerate at this time. We need this bug to be fixed ASAP, and fixed in a way that definitely doesn't introduce any new bugs. Every cleanup I have ever done of code in NDT has revealed hidden assumptions and dependencies between far-flung pieces of the program, and has broken a lot of things that worked by accident. I don't want that to happen here and prevent the timely rollout of a fix.

I think we might be talking about different things. When I say that we should add single function calls instead of adding code blocks, I'm talking about something like changing this block to something like:

int wait_for_packet_trace_start(packet_trace_pipe) {
   ...
}

...

int test_c2s(...) {
   ...
   packet_trace_running = wait_for_packet_trace_start(mon_pipe[0]);
   ...
}

This isn't a cleanup because we're not touching existing code. We're just adding code in a way that minimizes stress on an overtaxed function.

The suggestion I marked as optional does involve lightly touching existing code, but that's why I marked it as optional.


_src/test_c2s_srv.c, line 699 [r1] (raw file):_

Previously, pboothe (Peter Boothe) wrote… > I have no idea what this comment ever meant. I can delete it, but can not update it.

I was talking about s/speed-chk/packet tracing/ but if it makes more sense to delete entirely, that's fine.


_src/test_c2s_srv.c, line 723 [r1] (raw file):_

Previously, pboothe (Peter Boothe) wrote… > You're proposing spending 12 lines to avoid spending 4, so I'd prefer to keep as is. If we start engaging in cleanups we will be here for weeks.

704-708 were fine prior to this change (modulo magic numbers) but if we copy/paste them, we're introducing a problem. I see this as mess prevention rather than cleanup of existing code.

Can we move 704-708 to within the for loop? That doesn't cause a net increase in LOC.

If not, I think minimizing the bloat of test_c2s is more important than minimizing overall LOC of the file, so I think adding a function is worth it even if it causes the overall LOC to increase.

That said, I don't understand the "spending 12 lines to avoid spending 4" calculation. We could refactor the 4 lines into a function that's 6-7 lines, then we reduce each 4-line block to a 1-line function call, so it seems like spending 7 lines to save 6. To me, this seems desirable considering that the 6 lines are coming out of a gigantic function.


_src/test_c2s_srv.c, line 738 [r1] (raw file):_

      } else {
        /* There is something to read, so get it from the pktpair child.  If an
         * interrupt occurs, just skip the read and go on

I'm confused then. It says:

If an interrupt occurs, just skip the read and go on

But the interrupt handling seems to be above and not here. Is read internally handling interrupts?


_src/test_c2s_srv.c, line 766 [r1] (raw file):_

Previously, pboothe (Peter Boothe) wrote… > I am preserving existing behavior, and I have no idea whether later pieces of code will reliably work okay if I change this IO to non-blocking. > > The only reason it's okay to change the other one to non-blocking is because the alternative is never calling `stop_packet_trace()` at all, and so we mitigate the risk of resource leaks by calling it in non-blocking mode as part of an obscure error pathway. > > I would rate changing the blocking/non-blocking status of this particular write as a risky change. Every time I have changed a piece of NDT code in a way that affects threading and/or IPC it has caused bugs, sometimes a lot and in code that is very distant, and that's what the blocking/non-blocking change affects.

I'm not sure if my note came across clearly. I'm not proposing that we change behavior here; I'm asking that we document why we've introduced inconsistency between the two stop_packet_trace calls.


_src/test_c2s_srv.c, line 181 [r2] (raw file):_

    activeStreams = connections_to_fd_set(c2s_conns, streamsNum, &rfd, &max_fd);
  }
}

I think this got overlooked because it was a top-level comment, so repeating it in code:

I felt like the commit description on 5bce8cd was the most helpful in giving context for this PR. Can we integrate that into the PR description? I'm not sure if you're planning to do that as part of squash & merge, but I feel like it's helpful to aim for the final PR description as early as possible so the reviewer's getting the same context that someone would get when they go back to read this (squashed) commit in the change history.


Comments from Reviewable

pboothe commented 8 years ago

Review status: 0 of 1 files reviewed at latest revision, 6 unresolved discussions.


_src/test_c2s_srv.c, line 533 [r1] (raw file):_

Previously, mtlynch (Michael Lynch) wrote… > I think we might be talking about different things. When I say that we should add single function calls instead of adding code blocks, I'm talking about something like changing this block to something like: > > ``` > int wait_for_packet_trace_start(packet_trace_pipe) { > ... > } > > ... > > int test_c2s(...) { > ... > packet_trace_running = wait_for_packet_trace_start(mon_pipe[0]); > ... > } > ``` > > This isn't a cleanup because we're not touching existing code. We're just adding code in a way that minimizes stress on an overtaxed function. > > The suggestion I marked as optional _does_ involve lightly touching existing code, but that's why I marked it as optional.

Made the helper function for this chunk, left the optional suggestion alone. Although later changes do actually get close to that goal.


_src/test_c2s_srv.c, line 699 [r1] (raw file):_

Previously, mtlynch (Michael Lynch) wrote… > I was talking about s/speed-chk/packet tracing/ but if it makes more sense to delete entirely, that's fine.

Done.


_src/test_c2s_srv.c, line 723 [r1] (raw file):_

Previously, mtlynch (Michael Lynch) wrote… > 704-708 were fine prior to this change (modulo magic numbers) but if we copy/paste them, we're introducing a problem. I see this as mess prevention rather than cleanup of existing code. > > Can we move 704-708 to within the `for` loop? That doesn't cause a net increase in LOC. > > If not, I think minimizing the bloat of `test_c2s` is more important than minimizing overall LOC of the file, so I think adding a function is worth it even if it causes the overall LOC to increase. > > That said, I don't understand the "spending 12 lines to avoid spending 4" calculation. We could refactor the 4 lines into a function that's 6-7 lines, then we reduce each 4-line block to a 1-line function call, so it seems like spending 7 lines to save 6. To me, this seems desirable considering that the 6 lines are coming out of a gigantic function.

Moved inside, which actually allowed the deletion of several lines at the end.


_src/test_c2s_srv.c, line 738 [r1] (raw file):_

Previously, mtlynch (Michael Lynch) wrote… > I'm confused then. It says: > > > If an interrupt occurs, just skip the read and go on > > But the interrupt handling seems to be above and not here. Is `read` internally handling interrupts?

No, instead if read() is interrupted it will return -1 and set errno to EINTR. I read this comment as an explanation for why it is okay to not have a continue or fancy flow control in this if (...) { block.

The error handling above this line deals with errors that happened during select() not any that may happen during read().


_src/test_c2s_srv.c, line 766 [r1] (raw file):_

Previously, mtlynch (Michael Lynch) wrote… > I'm not sure if my note came across clearly. I'm not proposing that we change behavior here; I'm asking that we document why we've introduced inconsistency between the two `stop_packet_trace` calls.

Cool. But the reason is "preserve existing behavior", which is a strange justification to put in a comment. A comment saying // This is like this because it used to be like this feels a bit like a sign warning about the sharp edges on the sign. I will do it if you really think it aids readability, but if I encountered such a comment I would not thank the author for writing it.


_src/test_c2s_srv.c, line 181 [r2] (raw file):_

Previously, mtlynch (Michael Lynch) wrote… > I think this got overlooked because it was a top-level comment, so repeating it in code: > > > I felt like the commit description on 5bce8cd was the most helpful in giving context for this PR. Can we integrate that into the PR description? I'm not sure if you're planning to do that as part of squash & merge, but I feel like it's helpful to aim for the final PR description as early as possible so the reviewer's getting the same context that someone would get when they go back to read this (squashed) commit in the change history.

Done.


Comments from Reviewable

mtlynch commented 8 years ago
:lgtm:
Previously, mtlynch (Michael Lynch) wrote… > I felt like the commit description on 5bce8cd was the most helpful in giving context for this PR. Can we integrate that into the PR description? I'm not sure if you're planning to do that as part of squash & merge, but I feel like it's helpful to aim for the final PR description as early as possible so the reviewer's getting the same context that someone would get when they go back to read this (squashed) commit in the change history.

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions.


_src/test_c2s_srv.c, line 766 [r1] (raw file):_

Previously, pboothe (Peter Boothe) wrote… > Cool. But the reason is "preserve existing behavior", which is a strange justification to put in a comment. A comment saying `// This is like this because it used to be like this` feels a bit like [a sign warning about the sharp edges on the sign](http://www.shutterstock.com/pic-55017349/stock-photo-stupid-sign-warning-sign-has-sharp-edges.html). I will do it if you really think it aids readability, but if I encountered such a comment I would not thank the author for writing it.

I feel like TODOs are good notices of "we know this is weird, but we haven't fixed it yet." Suggest:

// TODO: Refactor to use shutdown_packet_trace_non_blocking once we verify that
// blocking is not necessary.

_src/test_c2s_srv.c, line 198 [r3] (raw file):_

 * non-blocking. This should not affect any code because any shutdown messages
 * sent should be the last usage of the pipe anyway.
 * @param mon_pipe the pipe to send on which to send shutdown messages.

extra "to send"


_src/test_c2s_srv.c, line 200 [r3] (raw file):_

 * @param mon_pipe the pipe to send on which to send shutdown messages.
 */
void packet_trace_emergency_shutdown(int *mon_pipe) {

(optional) I have no idea what the mon in mon_pipe means. French developer? We have an opportunity to give it a clearer name here, but at the same time I see the value in using the same variable name throughout.


_src/test_c2s_srv.c, line 200 [r3] (raw file):_

 * @param mon_pipe the pipe to send on which to send shutdown messages.
 */
void packet_trace_emergency_shutdown(int *mon_pipe) {

Suggest shutdown_packet_trace_non_blocking. Emergency seems to assume intent rather than describe functionality. A caller might want to shut down packet tracing and just not block (emergency or not).


_src/test_c2s_srv.c, line 215 [r3] (raw file):_

/** Waits up to one second for the passed-in fd to become readable.
 * @param fd the file descriptor to wait for
 * @returns true if the fd is readable, false otherwise

(optional) Do we want to say "true" and "false"? It seems like the rest of the comments are in terms of 1 and 0.


_src/test_c2s_srv.c, line 223 [r3] (raw file):_

  FD_SET(fd, &rfd);
  memset(&sel_tv, 0, sizeof(sel_tv));
  sel_tv.tv_sec = 1;  // Wait for up to 1 second

(optional) I feel like we've talked about this at some point, and I don't recall if there's a deliberate reason not to do this in NDT, but the previous six lines could be reduced to:

fd_set rfd = {};
struct timeval sel_tv = {};
FD_SET(fd, &rfd);
sel_tv.tv_sec = 1;  // Wait for up to 1 second

Comments from Reviewable

pboothe commented 8 years ago

Review status: 0 of 1 files reviewed at latest revision, 6 unresolved discussions.


_src/test_c2s_srv.c, line 766 [r1] (raw file):_

Previously, mtlynch (Michael Lynch) wrote… > I feel like TODOs are good notices of "we know this is weird, but we haven't fixed it yet." Suggest: > > `````` c > // TODO: Refactor to use shutdown_packet_trace_non_blocking once we verify that > // blocking is not necessary. > ```

Done.


_src/test_c2s_srv.c, line 198 [r3] (raw file):_

Previously, mtlynch (Michael Lynch) wrote… > extra "to send"

Done.


_src/test_c2s_srv.c, line 200 [r3] (raw file):_

Previously, mtlynch (Michael Lynch) wrote… > (optional) I have no idea what the `mon` in `mon_pipe` means. French developer? We have an opportunity to give it a clearer name here, but at the same time I see the value in using the same variable name throughout.

monitoring pipe, I have always assumed. Keeping the name because the function is single-use.


_src/test_c2s_srv.c, line 200 [r3] (raw file):_

Previously, mtlynch (Michael Lynch) wrote… > Suggest `shutdown_packet_trace_non_blocking`. Emergency seems to assume intent rather than describe functionality. A caller might want to shut down packet tracing and just not block (emergency or not).

I like "emergency" because it allows the error path to be log_println(0, ...) instead of having to write a reasonable error path like one would do for a function that was not meant to be called as a last resort. If this function were meant to be called in the normal course of events, I would write it differently and it would have a different signature.


_src/test_c2s_srv.c, line 215 [r3] (raw file):_

Previously, mtlynch (Michael Lynch) wrote… > (optional) Do we want to say "true" and "false"? It seems like the rest of the comments are in terms of 1 and 0.

I did, because I return the result of a boolean statement, so I return a truth value, which is really "non-zero" and "zero" according to the C spec and I just don't want to even get into it so I said "true" and "false" to avoid the discussion of whether boolean operators are guaranteed to always return 1 for true.


_src/test_c2s_srv.c, line 223 [r3] (raw file):_

Previously, mtlynch (Michael Lynch) wrote… > (optional) I feel like we've talked about this at some point, and I don't recall if there's a deliberate reason not to do this in NDT, but the previous six lines could be reduced to: > > `````` c > fd_set rfd = {}; > struct timeval sel_tv = {}; > FD_SET(fd, &rfd); > sel_tv.tv_sec = 1; // Wait for up to 1 second > ```

Pedantry note:

fd_set rfd = {0};
struct timeval sel_tv = {0};
FD_SET(fd, &rfd);
sel_tv.tv_sec = 1;  // Wait for up to 1 second

Also: good call. Done.


Comments from Reviewable