Closed rvagg closed 8 years ago
The stdio issues are bear some similarity I think.
I don't think it does. The realpath thing is a clear v5->v6 regression with a fairly straightforward solution (revert.)
The stdio woes are different: there is no single good solution, plus it's a problem that goes back a long way (v0.6 and possibly even before that.)
wrt stdio, I'm also including the non-flushed output on exit which was only introduced soon after v1 but we've let that persist even after having reports about it and it being an obvious problem for first-line postmortem debugging.
https://github.com/nodejs/node/pull/6773 fixes flushing stdio upon process.exit() on UNIX, be it tty or piped. Some may deem it to be inelegant or not in spirit with the philosophy of libuv but it's simple and does work as shown by this known-fail test being converted to a normal passing test. Windows does not appear to have this flushing stdio upon process.exit() issue - for piped stdio output at least.
Other approaches to resolve this longstanding bug are welcome, I just put forward that solution to show that it was indeed possible.
https://github.com/nodejs/node/pull/6773 doesn't address truncating stdio upon uncaught exceptions, but it's essentially the same issue and would require a similar fix.
@bnoordhuis Not to speak for @rvagg, but I think the similarity is not that they're identical issues exactly, but they were similar in the sense of "We thought we understood the scope of a breaking change, but it turns out we did not, and this caused some community unrest which has been challenging to address to everyone's satisfaction."
Clearly, you're right, the technical root cause and the simplicity of the solution are very different in these two cases, so they may call for different fixes. But there's a similar process issue that should probably get addressed, apart from the specific technical details.
Rod, I appreciate you raising this issue and treating it as an opportunity to improve the CTC's processes moving forward. Thank you for this :)
@bnoordhuis
The realpath thing is a clear v5->v6 regression with a fairly straightforward solution (revert.)
Simply reverting nodejs/node#3594 would be a semver-major change, as it will break the encoding option, so the fix (if any) would have to take that into an account.
"We thought we understood the scope of a breaking change, but it turns out we did not, and this caused some community unrest which has been challenging to address to everyone's satisfaction."
process.exit()
truncating stdio piped output was a known bug before the node 6.0.0 release with an expected fail test for it:
https://github.com/nodejs/node/blame/master/test/known_issues/test-stdout-buffer-flush-on-exit.js
nodejs/node#6773 fixes flushing stdio upon process.exit() on UNIX
@kzc With the downside that node blocks indefinitely if the read end of the pipe isn't reading. We may deem that to be an acceptable trade-off but it will certainly trip up some users.
That's what I mean with no single good solution, all the proposed fixes have some drawbacks.
Clearly, you're right, the technical root cause and the simplicity of the solution are very different in these two cases, so they may call for different fixes. But there's a similar process issue that should probably get addressed, apart from the specific technical details.
@isaacs Why I don't think you can compare the two:
It's apples and oranges, IMO. If we're going to do some soul searching, it should be around issues where there is a clear institutional failure, not ones where we're stuck between a rock and a hard place.
With the downside that node blocks indefinitely if the read end of the pipe isn't reading. We may deem that to be an acceptable trade-off but it will certainly trip up some users.
@bnoordhuis How is that different than any other UNIX process with blocking stdout piped to a non-responsive process? When weighed against the present behavior of always truncating stdio output after 64KB upon process.exit()
it's certainly acceptable behavior. I don't see any downside.
You're making a judgment call. You may think it's acceptable behavior but there will most certainly be people who disagree.
My judgement call is based on how most UNIX utilities function. It's expected behavior.
@bnoordhuis Those are good points.
Would it be possible to redirect the discussion about the similarity to the reaction to fs.realpath changes and stdio changes to a separate thread? Regardless of the difference or similarity in the two cases, I think that @rvagg is bringing up a valid point that is in danger of being derailed.
I think if there's a conversation to be had about what can be learned from the stdio impact, that should maybe be a second discussion. Perhaps it will lead to some overlapping conclusions to this one, perhaps not, but they're probably both useful.
@kzc imho, that even is very opinionated. UNIX doesn't lend itself well for that kind of determinism. Also your points, imo, are not in scope of this thread.
@eljefedelrodeodeljefe I was just pointing out common UNIX practise. We're discussing why there is resistance to fix a fundamental stdio bug. No other computer language that I'm aware of truncates stdio output upon exit or uncaught exception. It goes against the principle of least astonishment.
@kzc again, stdio issues are not the main topic of this thread. It could have been left out in the description, but it hasn't, to point out a more managerial issue. The realpath issue is big enough, let's give it some room here.
I trust @bnoordhuis, @isaacs et al to generally be very practical (note this as opposed to philosophy) about this one issue and then move to the next one.
Thx.
This is not a matter of philosophy, it's a practical issue affecting real world programs including my own. It's one thing if there's a user-land stdio flush workaround, but in this case there is no alternative other than to maintain a fork of node.
I'll end my contribution to this thread by agreeing with @rvagg's statement: "One theme that I'm seeing, although not overt and and certainly not by everyone, is a preference for correctness, performance and purity over other concerns."
@rvagg, thanks for starting this discussion and for the clear layout of the sequence of events. I think there's a real opportunity here to improve both how the project deals with semver-major releases (and maybe LTS).
Here are things I would like to see:
citgm
. citgm
is still new, but it's already become the most effective technical tool to reassure everyone that large changes can be made to Node without grave consequences. It's still in need of improvement, though – it would have been extremely useful to know ahead of release that the realpath
changes were going to so badly exacerbate npm/npm#5082, which created a bit of a crisis for the npm CLI team.citgm
into the release process as a release gate. citgm
is only useful if you listen to what it's telling you. As @rvagg's writeup implies, we did ourselves no favors by deciding to mark the glob
test failures as flaky and ship Node.js 6.0.0 with that change anyway. Building it up to the point where it makes sense to use as part of the go / no-go checklist for a release, and is treated as a semi-normative conformance suite for new releases (where changes to the tests are vetted in the same way that PRs are vetted for Node itself) would be a big step forward in maturity for Node's QA process.realpath
change and the module loader change might have caught that there were likely to be significant consequences for those changes (and in the case of the module loader change, might have identified the historical reasons for why symlink resolution of linked modules worked as it did).ied
, pnpm
, and npm-install
, and it took far too long (IMO) for everyone to realize that it was important to revert that and then try to come up with a solution to the problem that would work for everyone.testing
package pool in the run up to a new release, so the QA team has a chance to do some more thorough integration testing of the distribution before release. As a result, their major releases rarely have serious issues, and they have an impressively low number of hotfixes to ship. It's true that the Debian release process is also very sloooow, but I think that there's a practical middle ground, where anything that's known to be a breaking change is given some extra time to settle before an x.0.0 release is put out. I get that it's super tempting to try to slide these changes in under the wire, especially given the biannual semver-release deadlines, but one of the goals of the LTS process, and adopting semver for Node itself, was to prevent new major releases from being a messy process of jamming out a bunch of hotfix releases in short order. Wasn't it?Tl;dr: QA is good, more eyes on risky things is good, being deliberate is good. I feel like adding in even a few of the above would improve the quality of new major versions significantly, which is cool, because at least one of them is already happening.
I really like the points brought up by @othiym23, so much so I'm going to respond in line.
citgm
into the release process as a release gate.
For the most part we currently treat it as such. Exceptions are made at times, such as with node-glob. This was something that was discussed between multiple collaborators before the release. While we could have been more conservative here I am not convinced this part of our process is broken. We saw this exact situation happen with the require
regressions. We changed the semantics of how symbolic links were treated in the internal module cache which broke a bunch of workflow and alternative package managers. We spent quite a bit of time trying to find a solution that allowed us to keep both behaviors but eventually reverted and added the new behavior behind a flag. This change involving symbolic links was completely tangential to the realpath discussion in this thread.
I'm not 100% what the solution is here, but perhaps there is a way we can lower the stakes of a revert, especially for pre lts releases.
this has been brought up quite a number of times... 1000% this. We just need to agree and document the set amount of time. Does one month sound reasonable?
Why don't we just revert, bump to 7.x and make 7.x the LTS candidate? Not sure why we feel we need to tie major version numbers to time or v8 releases...
I really like the idea of having domain experts on (or even not on) the CTC who can be brought in to help get some perspective on the scope of a breaking change before it ships.
In this particular issue with fs.realpath, as @bnoordhuis pointed out, there's a pretty clear set of options: (1) accept the regression and do nothing, (2) revert and go back to a JS impl, (3) use the C impl and fall back to JS impl in known edge cases, (4) put the C impl in a different API or some other thing.
In more complicated examples, like the module or stdio change, it's very hard to have that discussion after the fact, because the breaking change is a fait accompli at that point. As soon as it's released someone probably relies on it, and the best we can do is put it behind a flag, which doesn't feel great. Also, at that point it's hard to dive back in and figure out what problem it was supposed to solve, and come up with alternative approaches.
It'd be good if this discussion could happen earlier, so that stuff like the proposals in https://github.com/isaacs/node6-module-system-change could be considered when they might be useful. Then instead of figuring out who gets marginalized, we're figuring out a solution that doesn't marginalize anyone.
@isaacs I was under the impression that experts reviewing problems in their domain was already a part of the review process.
I think that the problem is there is no baseline for how much review a pull request gets. Maybe we need to have a gated process where working groups review all pull requests under their responsibility.
Treat unexpected consequences of changes as regressions & quickly revert them.
I think this should be within reason. I can see "OMG SOMETHING BROKE, REVERT!" reactions messing with the code base. I have introduced regressions in the past, but have been able to land a fix in < 48 hours. There should be some sort of time limit on whether the issue can be fixed or not.
@trevnorris I 100% agree. Maybe the word "revert" here is delving into implementation details prematurely? Maybe this is more good?
Treat unexpected consequences of changes as regressions & quickly address them, by reverting the commit if a straightforward fix is not available.
In cases where it's going to take 2 weeks to get a fix, a "revert and try again" is probably best. If it's a very tiny and obvious thing, then a move-forward fix is probably better than a revert, but that option should always be on the table, and we shouldn't see it as a "failure". (Except insofar as it may be an opportunity to make the error-catching process more robust.)
I'm going to close this. Feel free to re-open or comment if you believe there is more to say or more to do on this at this time.
I will be doing some of this at my upcoming talk at NodeConf.eu... of course, I'm more familiar with the stdio issues I will also be talking about so I may miss some context that I wasn't involved in.
@nodejs/collaborators
I'm opening this issue so we can look critically at our processes and figure out how we might improve things to avoid things like this in the future.
I'm not totally around the issue but here's the high-level view that I'm seeing
fs.realpath()
is identified as being significantly slower thanrealpath(3)
realpath(3)
this is deemed to be a breaking change but only because it replaces thecache
argument with anoptions
argument (both areObject
)We didn't intend to break anything more than the
cache
option (and even then, the breakage was not your code won't run breakage, it just won't run quite how you expect unless you're doing something funky with the cache). We discovered that it broke glob's tests. The breakage of glob was marked as ignorable (flaky) and we proceeded with v6. Even though we've had a full postmortem of the problems by now and we've also been able to identify other breakages coming from this, we've still not acted.Here's some questions to get us going, and let's try to leave discussion about the specific actions on this one to https://github.com/nodejs/node/issues/7175 and focus on process here and see if we can figure out:
My personal judgement on (2) is that we have handled this poorly and that something is broken and needs to be identified and fixed. This has delivered a poor experience for Node.js users and we should consider this a black-eye and something to avoid in the future, i.e. a mistake to learn from. I see the breakdown purely as process rather than about any action by individuals. I'm also concerned by the lack of decisiveness on taking action here, we've had v6 out for a couple of months now and we still haven't done anything on this.
The stdio issues are bear some similarity I think. In certain areas we're having difficulty acting decisively to address real user experience issues. One theme that I'm seeing, although not overt and and certainly not by everyone, is a preference for correctness, performance and purity over other concerns. I don't know if this is a real problem because it comes out of the diversity of our collaborator group, but it's possible that having this in the discussion mix is partly responsible for our difficulty in making decisive headway in dealing with problems like this. Maybe we need to more clearly build priorities into the culture we have around core.
When critiquing our actions we should put things in perspective, because overall I think we've done an amazing job since v4 to lift standards and earned an impressive amount of trust and respect from our users. Let's just strive to always do better.