Postmortem of fs.realpath() changes leading to userland breakage

rvagg commented 8 years ago

@nodejs/collaborators

I'm opening this issue so we can look critically at our processes and figure out how we might improve things to avoid things like this in the future.

I'm not totally around the issue but here's the high-level view that I'm seeing

https://github.com/nodejs/node-v0.x-archive/issues/7902 -> https://github.com/nodejs/node/issues/2680 fs.realpath() is identified as being significantly slower than realpath(3)
https://github.com/nodejs/node/pull/3594 introduces a change that cuts the JS implementation and replaces it with a new libuv implementation that directly uses realpath(3) this is deemed to be a breaking change but only because it replaces the cache argument with an options argument (both are Object)
https://github.com/nodejs/node/pull/3594#issuecomment-210342608 change is finally landed on the 15th of April
https://github.com/nodejs/node/pull/3594#issuecomment-213060070 citgm picks up glob failure on the 22nd of April
Discussion and problem dissection takes place, an attempt to fix glob is made @ https://github.com/isaacs/node-glob/pull/259 but is ultimately deemed by @isaacs to lead to too big a breaking change and is not accepted, although that happened well after v6 went out
v6.0.0 is released containing the change on the 27th if April
https://github.com/nodejs/citgm/pull/126 citgm is updated and glob is added as flaky, although the PR doesn't mention glob explicitly
Minor concerns are raised post-v6 in the original PR but no further movement is made and discussion ends
We get a string of Windows errors that @addaleax is identifying as being rooted in this issue, see https://github.com/nodejs/node/issues/7175#issuecomment-227210966
https://github.com/nodejs/node/issues/7175 @isaacs opens an issue with a detailed rundown of the problems its causing on the 6th of June, discussion ensues, decisive action is yet to be taken (may change once we discuss at CTC meeting today)

We didn't intend to break anything more than the cache option (and even then, the breakage was not your code won't run breakage, it just won't run quite how you expect unless you're doing something funky with the cache). We discovered that it broke glob's tests. The breakage of glob was marked as ignorable (flaky) and we proceeded with v6. Even though we've had a full postmortem of the problems by now and we've also been able to identify other breakages coming from this, we've still not acted.

Here's some questions to get us going, and let's try to leave discussion about the specific actions on this one to https://github.com/nodejs/node/issues/7175 and focus on process here and see if we can figure out:

Clarify the steps that occurred in the chain of actions—is there anything to change about what I've listed above?
Do we collectively think anything is broken in our processes
If we have process breakage, what can we do to improve?

My personal judgement on (2) is that we have handled this poorly and that something is broken and needs to be identified and fixed. This has delivered a poor experience for Node.js users and we should consider this a black-eye and something to avoid in the future, i.e. a mistake to learn from. I see the breakdown purely as process rather than about any action by individuals. I'm also concerned by the lack of decisiveness on taking action here, we've had v6 out for a couple of months now and we still haven't done anything on this.

The stdio issues are bear some similarity I think. In certain areas we're having difficulty acting decisively to address real user experience issues. One theme that I'm seeing, although not overt and and certainly not by everyone, is a preference for correctness, performance and purity over other concerns. I don't know if this is a real problem because it comes out of the diversity of our collaborator group, but it's possible that having this in the discussion mix is partly responsible for our difficulty in making decisive headway in dealing with problems like this. Maybe we need to more clearly build priorities into the culture we have around core.

When critiquing our actions we should put things in perspective, because overall I think we've done an amazing job since v4 to lift standards and earned an impressive amount of trust and respect from our users. Let's just strive to always do better.

bnoordhuis commented 8 years ago

The stdio issues are bear some similarity I think.

I don't think it does. The realpath thing is a clear v5->v6 regression with a fairly straightforward solution (revert.)

The stdio woes are different: there is no single good solution, plus it's a problem that goes back a long way (v0.6 and possibly even before that.)

rvagg commented 8 years ago

wrt stdio, I'm also including the non-flushed output on exit which was only introduced soon after v1 but we've let that persist even after having reports about it and it being an obvious problem for first-line postmortem debugging.

kzc commented 8 years ago

https://github.com/nodejs/node/pull/6773 fixes flushing stdio upon process.exit() on UNIX, be it tty or piped. Some may deem it to be inelegant or not in spirit with the philosophy of libuv but it's simple and does work as shown by this known-fail test being converted to a normal passing test. Windows does not appear to have this flushing stdio upon process.exit() issue - for piped stdio output at least.

Other approaches to resolve this longstanding bug are welcome, I just put forward that solution to show that it was indeed possible.

https://github.com/nodejs/node/pull/6773 doesn't address truncating stdio upon uncaught exceptions, but it's essentially the same issue and would require a similar fix.

isaacs commented 8 years ago

@bnoordhuis Not to speak for @rvagg, but I think the similarity is not that they're identical issues exactly, but they were similar in the sense of "We thought we understood the scope of a breaking change, but it turns out we did not, and this caused some community unrest which has been challenging to address to everyone's satisfaction."

Clearly, you're right, the technical root cause and the simplicity of the solution are very different in these two cases, so they may call for different fixes. But there's a similar process issue that should probably get addressed, apart from the specific technical details.

Rod, I appreciate you raising this issue and treating it as an opportunity to improve the CTC's processes moving forward. Thank you for this :)

ChALkeR commented 8 years ago

@bnoordhuis

The realpath thing is a clear v5->v6 regression with a fairly straightforward solution (revert.)

Simply reverting nodejs/node#3594 would be a semver-major change, as it will break the encoding option, so the fix (if any) would have to take that into an account.

kzc commented 8 years ago

"We thought we understood the scope of a breaking change, but it turns out we did not, and this caused some community unrest which has been challenging to address to everyone's satisfaction."

process.exit() truncating stdio piped output was a known bug before the node 6.0.0 release with an expected fail test for it:

https://github.com/nodejs/node/blame/master/test/known_issues/test-stdout-buffer-flush-on-exit.js

bnoordhuis commented 8 years ago

nodejs/node#6773 fixes flushing stdio upon process.exit() on UNIX

@kzc With the downside that node blocks indefinitely if the read end of the pipe isn't reading. We may deem that to be an acceptable trade-off but it will certainly trip up some users.

That's what I mean with no single good solution, all the proposed fixes have some drawbacks.

Clearly, you're right, the technical root cause and the simplicity of the solution are very different in these two cases, so they may call for different fixes. But there's a similar process issue that should probably get addressed, apart from the specific technical details.

@isaacs Why I don't think you can compare the two:

The realpath change introduced a clear but unintentional regression. There is a straightforward fix but we haven't applied it for $reasons.
The issues with stdio are well understood but exist in this uncomfortable twilight zone where one man's bug fix is another's regression. That's been going on for years, it's not new.

It's apples and oranges, IMO. If we're going to do some soul searching, it should be around issues where there is a clear institutional failure, not ones where we're stuck between a rock and a hard place.

kzc commented 8 years ago

With the downside that node blocks indefinitely if the read end of the pipe isn't reading. We may deem that to be an acceptable trade-off but it will certainly trip up some users.

@bnoordhuis How is that different than any other UNIX process with blocking stdout piped to a non-responsive process? When weighed against the present behavior of always truncating stdio output after 64KB upon process.exit() it's certainly acceptable behavior. I don't see any downside.

bnoordhuis commented 8 years ago

You're making a judgment call. You may think it's acceptable behavior but there will most certainly be people who disagree.

kzc commented 8 years ago

My judgement call is based on how most UNIX utilities function. It's expected behavior.

isaacs commented 8 years ago

@bnoordhuis Those are good points.

Would it be possible to redirect the discussion about the similarity to the reaction to fs.realpath changes and stdio changes to a separate thread? Regardless of the difference or similarity in the two cases, I think that @rvagg is bringing up a valid point that is in danger of being derailed.

I think if there's a conversation to be had about what can be learned from the stdio impact, that should maybe be a second discussion. Perhaps it will lead to some overlapping conclusions to this one, perhaps not, but they're probably both useful.

eljefedelrodeodeljefe commented 8 years ago

@kzc imho, that even is very opinionated. UNIX doesn't lend itself well for that kind of determinism. Also your points, imo, are not in scope of this thread.

kzc commented 8 years ago

@eljefedelrodeodeljefe I was just pointing out common UNIX practise. We're discussing why there is resistance to fix a fundamental stdio bug. No other computer language that I'm aware of truncates stdio output upon exit or uncaught exception. It goes against the principle of least astonishment.

eljefedelrodeodeljefe commented 8 years ago

@kzc again, stdio issues are not the main topic of this thread. It could have been left out in the description, but it hasn't, to point out a more managerial issue. The realpath issue is big enough, let's give it some room here.

I trust @bnoordhuis, @isaacs et al to generally be very practical (note this as opposed to philosophy) about this one issue and then move to the next one.

Thx.

kzc commented 8 years ago

This is not a matter of philosophy, it's a practical issue affecting real world programs including my own. It's one thing if there's a user-land stdio flush workaround, but in this case there is no alternative other than to maintain a fork of node.

I'll end my contribution to this thread by agreeing with @rvagg's statement: "One theme that I'm seeing, although not overt and and certainly not by everyone, is a preference for correctness, performance and purity over other concerns."

othiym23 commented 8 years ago

@rvagg, thanks for starting this discussion and for the clear layout of the sequence of events. I think there's a real opportunity here to improve both how the project deals with semver-major releases (and maybe LTS).

Here are things I would like to see:

Continue to improve citgm. citgm is still new, but it's already become the most effective technical tool to reassure everyone that large changes can be made to Node without grave consequences. It's still in need of improvement, though – it would have been extremely useful to know ahead of release that the realpath changes were going to so badly exacerbate npm/npm#5082, which created a bit of a crisis for the npm CLI team.
Integrate citgm into the release process as a release gate. citgm is only useful if you listen to what it's telling you. As @rvagg's writeup implies, we did ourselves no favors by deciding to mark the glob test failures as flaky and ship Node.js 6.0.0 with that change anyway. Building it up to the point where it makes sense to use as part of the go / no-go checklist for a release, and is treated as a semi-normative conformance suite for new releases (where changes to the tests are vetted in the same way that PRs are vetted for Node itself) would be a big step forward in maturity for Node's QA process.
Identify domain experts on the CTC and make sure they carefully review breaking changes. More expert eyes on both the realpath change and the module loader change might have caught that there were likely to be significant consequences for those changes (and in the case of the module loader change, might have identified the historical reasons for why symlink resolution of linked modules worked as it did).
Treat unexpected consequences of changes as regressions & quickly revert them. Many of the breaking changes in Node 6.0.0 brought unexpected consequences with them. In many of the discussions, the fact that real-world users were being gravely affected by these unexpected consequences got lost in the arguments over what the technically correct behavior should be. Node 6.0.0 effectively broke important pieces of ied, pnpm, and npm-install, and it took far too long (IMO) for everyone to realize that it was important to revert that and then try to come up with a solution to the problem that would work for everyone.
Institute a freeze on landing breaking changes to a semver-major release series for a set period of time before the release. Debian does something like this with its testing package pool in the run up to a new release, so the QA team has a chance to do some more thorough integration testing of the distribution before release. As a result, their major releases rarely have serious issues, and they have an impressively low number of hotfixes to ship. It's true that the Debian release process is also very sloooow, but I think that there's a practical middle ground, where anything that's known to be a breaking change is given some extra time to settle before an x.0.0 release is put out. I get that it's super tempting to try to slide these changes in under the wire, especially given the biannual semver-release deadlines, but one of the goals of the LTS process, and adopting semver for Node itself, was to prevent new major releases from being a messy process of jamming out a bunch of hotfix releases in short order. Wasn't it?
Bias semver-major versions that will become LTS releases towards backwards-compatibiliity. The process by which a version graduates from being a regular release series to the LTS release is poorly understood outside of Node core (or, at least, it was poorly understood within npm, where there are a fair number of reasonably proficient Node developers). I personally think it's totally OK to turn the crank on deprecations in an LTS release, and I also don't think it's reasonable to say that the project can only ship big changes once a year given the current pace of the project, but I do think giving extra scrutiny to breaking changes landed on a release series that's going to be LTS is a good idea. As this thread demonstrates, remedying the consequences of a breaking change gone bad is pretty difficult to begin with, and it doesn't help when there's the added complication of being obligated to support that change for several years in the mix.

Tl;dr: QA is good, more eyes on risky things is good, being deliberate is good. I feel like adding in even a few of the above would improve the quality of new major versions significantly, which is cool, because at least one of them is already happening.

MylesBorins commented 8 years ago

I really like the points brought up by @othiym23, so much so I'm going to respond in line.

Integrate citgm into the release process as a release gate. For the most part we currently treat it as such. Exceptions are made at times, such as with node-glob. This was something that was discussed between multiple collaborators before the release. While we could have been more conservative here I am not convinced this part of our process is broken.
Identify domain experts on the CTC and make sure they carefully review breaking changes. This is somewhere where we could have definitely done a better job. The problem is that many of the individuals with that domain knowledge are not as actively involved in the project (this is not a bad thing, just life).
Treat unexpected consequences of changes as regressions & quickly revert them. 1000x this. I think that we should be way less conservative with reverting things that are causing pain in userland. The biggest reason I can think of for not just reverting something would be semver. We can treat the unexpected behavior as a regression, but now we will be waiting another major release to get a specific change in. If we take that v6 will be LTS into account we end up in a situation where we want to avoid removing a potentially useful feature as long as possible before reverting to keep it in LTS.

We saw this exact situation happen with the require regressions. We changed the semantics of how symbolic links were treated in the internal module cache which broke a bunch of workflow and alternative package managers. We spent quite a bit of time trying to find a solution that allowed us to keep both behaviors but eventually reverted and added the new behavior behind a flag. This change involving symbolic links was completely tangential to the realpath discussion in this thread.

I'm not 100% what the solution is here, but perhaps there is a way we can lower the stakes of a revert, especially for pre lts releases.

Institute a freeze on landing breaking changes to a semver-major release series for a set period of time before the release.

this has been brought up quite a number of times... 1000% this. We just need to agree and document the set amount of time. Does one month sound reasonable?

Qard commented 8 years ago

Why don't we just revert, bump to 7.x and make 7.x the LTS candidate? Not sure why we feel we need to tie major version numbers to time or v8 releases...

isaacs commented 8 years ago

I really like the idea of having domain experts on (or even not on) the CTC who can be brought in to help get some perspective on the scope of a breaking change before it ships.

In this particular issue with fs.realpath, as @bnoordhuis pointed out, there's a pretty clear set of options: (1) accept the regression and do nothing, (2) revert and go back to a JS impl, (3) use the C impl and fall back to JS impl in known edge cases, (4) put the C impl in a different API or some other thing.

In more complicated examples, like the module or stdio change, it's very hard to have that discussion after the fact, because the breaking change is a fait accompli at that point. As soon as it's released someone probably relies on it, and the best we can do is put it behind a flag, which doesn't feel great. Also, at that point it's hard to dive back in and figure out what problem it was supposed to solve, and come up with alternative approaches.

It'd be good if this discussion could happen earlier, so that stuff like the proposals in https://github.com/isaacs/node6-module-system-change could be considered when they might be useful. Then instead of figuring out who gets marginalized, we're figuring out a solution that doesn't marginalize anyone.

benjamingr commented 8 years ago

@isaacs I was under the impression that experts reviewing problems in their domain was already a part of the review process.

I think that the problem is there is no baseline for how much review a pull request gets. Maybe we need to have a gated process where working groups review all pull requests under their responsibility.

trevnorris commented 8 years ago

Treat unexpected consequences of changes as regressions & quickly revert them.

I think this should be within reason. I can see "OMG SOMETHING BROKE, REVERT!" reactions messing with the code base. I have introduced regressions in the past, but have been able to land a fix in < 48 hours. There should be some sort of time limit on whether the issue can be fixed or not.

isaacs commented 8 years ago

@trevnorris I 100% agree. Maybe the word "revert" here is delving into implementation details prematurely? Maybe this is more good?

Treat unexpected consequences of changes as regressions & quickly address them, by reverting the commit if a straightforward fix is not available.

In cases where it's going to take 2 weeks to get a fix, a "revert and try again" is probably best. If it's a very tiny and obvious thing, then a move-forward fix is probably better than a revert, but that option should always be on the table, and we shouldn't see it as a "failure". (Except insofar as it may be an opportunity to make the error-catching process more robust.)

Trott commented 8 years ago

I'm going to close this. Feel free to re-open or comment if you believe there is more to say or more to do on this at this time.

Fishrock123 commented 8 years ago

I will be doing some of this at my upcoming talk at NodeConf.eu... of course, I'm more familiar with the stdio issues I will also be talking about so I may miss some context that I wasn't involved in.

nodejs / CTC

Postmortem of fs.realpath() changes leading to userland breakage #9