nodejs / CTC

Node.js Core Technical Committee & Collaborators
80 stars 27 forks source link

Postmortem of fs.realpath() changes leading to userland breakage #9

Closed rvagg closed 8 years ago

rvagg commented 8 years ago

@nodejs/collaborators

I'm opening this issue so we can look critically at our processes and figure out how we might improve things to avoid things like this in the future.

I'm not totally around the issue but here's the high-level view that I'm seeing

  1. https://github.com/nodejs/node-v0.x-archive/issues/7902 -> https://github.com/nodejs/node/issues/2680 fs.realpath() is identified as being significantly slower than realpath(3)
  2. https://github.com/nodejs/node/pull/3594 introduces a change that cuts the JS implementation and replaces it with a new libuv implementation that directly uses realpath(3) this is deemed to be a breaking change but only because it replaces the cache argument with an options argument (both are Object)
  3. https://github.com/nodejs/node/pull/3594#issuecomment-210342608 change is finally landed on the 15th of April
  4. https://github.com/nodejs/node/pull/3594#issuecomment-213060070 citgm picks up glob failure on the 22nd of April
  5. Discussion and problem dissection takes place, an attempt to fix glob is made @ https://github.com/isaacs/node-glob/pull/259 but is ultimately deemed by @isaacs to lead to too big a breaking change and is not accepted, although that happened well after v6 went out
  6. v6.0.0 is released containing the change on the 27th if April
  7. https://github.com/nodejs/citgm/pull/126 citgm is updated and glob is added as flaky, although the PR doesn't mention glob explicitly
  8. Minor concerns are raised post-v6 in the original PR but no further movement is made and discussion ends
  9. We get a string of Windows errors that @addaleax is identifying as being rooted in this issue, see https://github.com/nodejs/node/issues/7175#issuecomment-227210966
  10. https://github.com/nodejs/node/issues/7175 @isaacs opens an issue with a detailed rundown of the problems its causing on the 6th of June, discussion ensues, decisive action is yet to be taken (may change once we discuss at CTC meeting today)

We didn't intend to break anything more than the cache option (and even then, the breakage was not your code won't run breakage, it just won't run quite how you expect unless you're doing something funky with the cache). We discovered that it broke glob's tests. The breakage of glob was marked as ignorable (flaky) and we proceeded with v6. Even though we've had a full postmortem of the problems by now and we've also been able to identify other breakages coming from this, we've still not acted.

Here's some questions to get us going, and let's try to leave discussion about the specific actions on this one to https://github.com/nodejs/node/issues/7175 and focus on process here and see if we can figure out:

  1. Clarify the steps that occurred in the chain of actions—is there anything to change about what I've listed above?
  2. Do we collectively think anything is broken in our processes
  3. If we have process breakage, what can we do to improve?

My personal judgement on (2) is that we have handled this poorly and that something is broken and needs to be identified and fixed. This has delivered a poor experience for Node.js users and we should consider this a black-eye and something to avoid in the future, i.e. a mistake to learn from. I see the breakdown purely as process rather than about any action by individuals. I'm also concerned by the lack of decisiveness on taking action here, we've had v6 out for a couple of months now and we still haven't done anything on this.

The stdio issues are bear some similarity I think. In certain areas we're having difficulty acting decisively to address real user experience issues. One theme that I'm seeing, although not overt and and certainly not by everyone, is a preference for correctness, performance and purity over other concerns. I don't know if this is a real problem because it comes out of the diversity of our collaborator group, but it's possible that having this in the discussion mix is partly responsible for our difficulty in making decisive headway in dealing with problems like this. Maybe we need to more clearly build priorities into the culture we have around core.

When critiquing our actions we should put things in perspective, because overall I think we've done an amazing job since v4 to lift standards and earned an impressive amount of trust and respect from our users. Let's just strive to always do better.

bnoordhuis commented 8 years ago

The stdio issues are bear some similarity I think.

I don't think it does. The realpath thing is a clear v5->v6 regression with a fairly straightforward solution (revert.)

The stdio woes are different: there is no single good solution, plus it's a problem that goes back a long way (v0.6 and possibly even before that.)

rvagg commented 8 years ago

wrt stdio, I'm also including the non-flushed output on exit which was only introduced soon after v1 but we've let that persist even after having reports about it and it being an obvious problem for first-line postmortem debugging.

kzc commented 8 years ago

https://github.com/nodejs/node/pull/6773 fixes flushing stdio upon process.exit() on UNIX, be it tty or piped. Some may deem it to be inelegant or not in spirit with the philosophy of libuv but it's simple and does work as shown by this known-fail test being converted to a normal passing test. Windows does not appear to have this flushing stdio upon process.exit() issue - for piped stdio output at least.

Other approaches to resolve this longstanding bug are welcome, I just put forward that solution to show that it was indeed possible.

https://github.com/nodejs/node/pull/6773 doesn't address truncating stdio upon uncaught exceptions, but it's essentially the same issue and would require a similar fix.

isaacs commented 8 years ago

@bnoordhuis Not to speak for @rvagg, but I think the similarity is not that they're identical issues exactly, but they were similar in the sense of "We thought we understood the scope of a breaking change, but it turns out we did not, and this caused some community unrest which has been challenging to address to everyone's satisfaction."

Clearly, you're right, the technical root cause and the simplicity of the solution are very different in these two cases, so they may call for different fixes. But there's a similar process issue that should probably get addressed, apart from the specific technical details.

Rod, I appreciate you raising this issue and treating it as an opportunity to improve the CTC's processes moving forward. Thank you for this :)

ChALkeR commented 8 years ago

@bnoordhuis

The realpath thing is a clear v5->v6 regression with a fairly straightforward solution (revert.)

Simply reverting nodejs/node#3594 would be a semver-major change, as it will break the encoding option, so the fix (if any) would have to take that into an account.

kzc commented 8 years ago

"We thought we understood the scope of a breaking change, but it turns out we did not, and this caused some community unrest which has been challenging to address to everyone's satisfaction."

process.exit() truncating stdio piped output was a known bug before the node 6.0.0 release with an expected fail test for it:

https://github.com/nodejs/node/blame/master/test/known_issues/test-stdout-buffer-flush-on-exit.js

bnoordhuis commented 8 years ago

nodejs/node#6773 fixes flushing stdio upon process.exit() on UNIX

@kzc With the downside that node blocks indefinitely if the read end of the pipe isn't reading. We may deem that to be an acceptable trade-off but it will certainly trip up some users.

That's what I mean with no single good solution, all the proposed fixes have some drawbacks.

Clearly, you're right, the technical root cause and the simplicity of the solution are very different in these two cases, so they may call for different fixes. But there's a similar process issue that should probably get addressed, apart from the specific technical details.

@isaacs Why I don't think you can compare the two:

  1. The realpath change introduced a clear but unintentional regression. There is a straightforward fix but we haven't applied it for $reasons.
  2. The issues with stdio are well understood but exist in this uncomfortable twilight zone where one man's bug fix is another's regression. That's been going on for years, it's not new.

It's apples and oranges, IMO. If we're going to do some soul searching, it should be around issues where there is a clear institutional failure, not ones where we're stuck between a rock and a hard place.

kzc commented 8 years ago

With the downside that node blocks indefinitely if the read end of the pipe isn't reading. We may deem that to be an acceptable trade-off but it will certainly trip up some users.

@bnoordhuis How is that different than any other UNIX process with blocking stdout piped to a non-responsive process? When weighed against the present behavior of always truncating stdio output after 64KB upon process.exit() it's certainly acceptable behavior. I don't see any downside.

bnoordhuis commented 8 years ago

You're making a judgment call. You may think it's acceptable behavior but there will most certainly be people who disagree.

kzc commented 8 years ago

My judgement call is based on how most UNIX utilities function. It's expected behavior.

isaacs commented 8 years ago

@bnoordhuis Those are good points.

Would it be possible to redirect the discussion about the similarity to the reaction to fs.realpath changes and stdio changes to a separate thread? Regardless of the difference or similarity in the two cases, I think that @rvagg is bringing up a valid point that is in danger of being derailed.

I think if there's a conversation to be had about what can be learned from the stdio impact, that should maybe be a second discussion. Perhaps it will lead to some overlapping conclusions to this one, perhaps not, but they're probably both useful.

eljefedelrodeodeljefe commented 8 years ago

@kzc imho, that even is very opinionated. UNIX doesn't lend itself well for that kind of determinism. Also your points, imo, are not in scope of this thread.

kzc commented 8 years ago

@eljefedelrodeodeljefe I was just pointing out common UNIX practise. We're discussing why there is resistance to fix a fundamental stdio bug. No other computer language that I'm aware of truncates stdio output upon exit or uncaught exception. It goes against the principle of least astonishment.

eljefedelrodeodeljefe commented 8 years ago

@kzc again, stdio issues are not the main topic of this thread. It could have been left out in the description, but it hasn't, to point out a more managerial issue. The realpath issue is big enough, let's give it some room here.

I trust @bnoordhuis, @isaacs et al to generally be very practical (note this as opposed to philosophy) about this one issue and then move to the next one.

Thx.

kzc commented 8 years ago

This is not a matter of philosophy, it's a practical issue affecting real world programs including my own. It's one thing if there's a user-land stdio flush workaround, but in this case there is no alternative other than to maintain a fork of node.

I'll end my contribution to this thread by agreeing with @rvagg's statement: "One theme that I'm seeing, although not overt and and certainly not by everyone, is a preference for correctness, performance and purity over other concerns."

othiym23 commented 8 years ago

@rvagg, thanks for starting this discussion and for the clear layout of the sequence of events. I think there's a real opportunity here to improve both how the project deals with semver-major releases (and maybe LTS).

Here are things I would like to see:

Tl;dr: QA is good, more eyes on risky things is good, being deliberate is good. I feel like adding in even a few of the above would improve the quality of new major versions significantly, which is cool, because at least one of them is already happening.

MylesBorins commented 8 years ago

I really like the points brought up by @othiym23, so much so I'm going to respond in line.

We saw this exact situation happen with the require regressions. We changed the semantics of how symbolic links were treated in the internal module cache which broke a bunch of workflow and alternative package managers. We spent quite a bit of time trying to find a solution that allowed us to keep both behaviors but eventually reverted and added the new behavior behind a flag. This change involving symbolic links was completely tangential to the realpath discussion in this thread.

I'm not 100% what the solution is here, but perhaps there is a way we can lower the stakes of a revert, especially for pre lts releases.

this has been brought up quite a number of times... 1000% this. We just need to agree and document the set amount of time. Does one month sound reasonable?

Qard commented 8 years ago

Why don't we just revert, bump to 7.x and make 7.x the LTS candidate? Not sure why we feel we need to tie major version numbers to time or v8 releases...

isaacs commented 8 years ago

I really like the idea of having domain experts on (or even not on) the CTC who can be brought in to help get some perspective on the scope of a breaking change before it ships.

In this particular issue with fs.realpath, as @bnoordhuis pointed out, there's a pretty clear set of options: (1) accept the regression and do nothing, (2) revert and go back to a JS impl, (3) use the C impl and fall back to JS impl in known edge cases, (4) put the C impl in a different API or some other thing.

In more complicated examples, like the module or stdio change, it's very hard to have that discussion after the fact, because the breaking change is a fait accompli at that point. As soon as it's released someone probably relies on it, and the best we can do is put it behind a flag, which doesn't feel great. Also, at that point it's hard to dive back in and figure out what problem it was supposed to solve, and come up with alternative approaches.

It'd be good if this discussion could happen earlier, so that stuff like the proposals in https://github.com/isaacs/node6-module-system-change could be considered when they might be useful. Then instead of figuring out who gets marginalized, we're figuring out a solution that doesn't marginalize anyone.

benjamingr commented 8 years ago

@isaacs I was under the impression that experts reviewing problems in their domain was already a part of the review process.

I think that the problem is there is no baseline for how much review a pull request gets. Maybe we need to have a gated process where working groups review all pull requests under their responsibility.

trevnorris commented 8 years ago

Treat unexpected consequences of changes as regressions & quickly revert them.

I think this should be within reason. I can see "OMG SOMETHING BROKE, REVERT!" reactions messing with the code base. I have introduced regressions in the past, but have been able to land a fix in < 48 hours. There should be some sort of time limit on whether the issue can be fixed or not.

isaacs commented 8 years ago

@trevnorris I 100% agree. Maybe the word "revert" here is delving into implementation details prematurely? Maybe this is more good?

Treat unexpected consequences of changes as regressions & quickly address them, by reverting the commit if a straightforward fix is not available.

In cases where it's going to take 2 weeks to get a fix, a "revert and try again" is probably best. If it's a very tiny and obvious thing, then a move-forward fix is probably better than a revert, but that option should always be on the table, and we shouldn't see it as a "failure". (Except insofar as it may be an opportunity to make the error-catching process more robust.)

Trott commented 8 years ago

I'm going to close this. Feel free to re-open or comment if you believe there is more to say or more to do on this at this time.

Fishrock123 commented 8 years ago

I will be doing some of this at my upcoming talk at NodeConf.eu... of course, I'm more familiar with the stdio issues I will also be talking about so I may miss some context that I wasn't involved in.