nodejs / TSC

The Node.js Technical Steering Committee
595 stars 133 forks source link

How to handle "experimental" status when adoption is significant? #1397

Open Qard opened 1 year ago

Qard commented 1 year ago

I'm posting this here as I feel this needs TSC deliberation on the stance of the project.

We've had many things over the years stuck in "experimental" status limbo for years as things are still not ready or sometimes because they might never be ready. This is the case with async_hooks which no one wants to risk breaking because it will break the majority of users. (It's my personal goal to get abstractions or other APIs to a state where we can mark async_hooks itself as internal.) The lingering "experimental" status of some things is part of the issue I want to discuss, but moreso the disjoint between adoption and stability.

In Node.js 20 ESM loaders were moved off-thread with no flagging and no deprecation cycle around the ways existing users were using it. This broke ts-node and import-in-the-middle which is the only way for APM products to support ESM. As ESM loaders are an “experimental” feature this is technically allowed, however Node.js has historically generally approached significant changes to experimental features with broader adoption with much more care. Experimental status stages were introduced awhile ago to somewhat reflect differing stages of "experimental" status and I would have expected loaders to be marked as 1.2 at this point as it has been largely unchanged from a user perspective since v16. It also reads a bit weird to me though that 1.2 implies very nearly stable status but then the top-level description implies anything could be changed or removed at any time. Very confusing. 🤔

Anyway, I was rather surprised to see the off-thread loaders change not reverted immediately when it was discovered that it broke ts-node and import-in-the-middle. The change would have been fine if it spent some time behind a flag so users could try it out and make the necessary changes for their code to support it but this substantial refactor was released unflagged and due to being deemed ready on the day of code freeze for Node.js 20, it left zero time for the community to react. I'm sure some quality work went into the design, which I've seen on occasion when I have had the time to poke my head into the loaders code. But I feel there should have been more consideration of user impact with some effort to land the change in a way which would have been less disruptive.

As a widely adopted runtime that has in recent years been pushing hard on stability and performance to make Node.js more appealing and accommodating of enterprise users, this felt like a major step backward to me. This is the level of user consideration I would expect of the pre-1.0 days, not v20. I don't want to shame the contributors involved as I feel their work is very valuable and I would like to see loaders done, but I feel the TSC should probably be more involved in guiding larger efforts like this to ensure transitions are smoother. It's much better to take our time and focus on stable progress as it's much easier to add code to the project than it is to remove it after it's already been released. It's also not a good look when you announce a shiny new major release, especially on a line that will become LTS, and people go to try it out only to find several of the most popular tools are broken. 😅

Anyway, I hope we can better define how to handle future changes to still in-development APIs which already have user adoption so we can be a bit less disruptive in the future. As someone that writes code deployed in many, many enterprises, they get pretty cranky when I break their apps. 🙈

mcollina commented 1 year ago

I have a few sparse thoughts on this.

Loaders were very noisy and continuously emitted a runtime warning. Users silenced it anyway to support their features and therefore "acknowledged" this would have broken things in the future. Ultimately if users want stability, they should not have used any of the loaders' functionality and, therefore, not migrated to ESM if this feature was needed. The message from the Node.js project is unambiguous, so I'm unsure what needs to be clarified.

I have personally been vocal with a "hold" message on the ESM migration because I think it's not ready yet; however, others had different agendas. If loaders were a critical functionality for them, they should not have migrated to ESM.

Node.js v20 is not LTS yet, so there is no urgency for the community to update, and there are quite a few months ahead before October. I don't see any sense of urgency here, as it's stated everywhere that most users should use LTS releases - at the time of this writing, these lines as v16.x and v18.x. For example, Fastify has yet to be able to upgrade, with the change hopefully landing soon.

If we consider ts-node and import-in-the-middle critical, they should be added to CIGTM to prevent them from breaking between releases or communicating to the maintainers that they were due for an update.

Qard commented 1 year ago

I feel loaders was a bit of a weird situation because while loaders are experimental, ESM itself is specifically marked as stable so many users started using ESM but then upon being ready to deploy reached for an APM to support it and became reliant on this experimental feature as there's no alternative.

As an APM vendor dev, we have actively discouraged users from using ESM and asked them to transpile to CJS but many are refusing and we can't just turn away paying customers so we need to do whatever we can to support them. This meant having a clearly marked beta support for ESM, which OpenTelemetry also did, and that worked okay for awhile. But then that broke with Node.js v20. We were fortunately able to come up with a fix relatively quickly, but that added a bunch of unplanned work to our quarter to react to the breakage.

The problem here is that many users are just going to ignore our warnings and use things anyway, or in this particular case could even see the "stable" status on the ESM docs page and think ESM is adequately supported for typical application needs when it's actually not. We've had several ESM-using customers ask why our support for it is not GA yet, so they clearly are not understanding that this is not a stable capability in Node.js core.

If it's the stance of the TSC that people should not be using this that's fine, but then I feel there should be a task to ensure our communication around what is and is not "stable" about ESM needs to be improved.

GeoffreyBooth commented 1 year ago

while loaders are experimental, ESM itself is specifically marked as stable

CommonJS is stable too but the monkey-patching that APM vendors do to make instrumentation work there is completely undocumented and unsupported, not even experimental. It just doesn’t change very often, so vendors have gotten away with depending on it without a significant resource allocation to respond to breaking changes. The Loaders API is the first attempt at a blessed API for both module systems to provide customization hooks for module lifecycle events, that instrumentation vendors can use with confidence once it’s marked as stable, which I expect to happen in the coming months. So things will get better soon.

I understand that Node’s relationship with APM vendors is symbiotic, that we wouldn’t succeed without them and vice versa, but ultimately what this request is asking for is for APMs to be able to do their jobs more cheaply. Change Node a little slower, so that vendors have longer timelines for adaptation. Or change experimental APIs much faster, so that they become stable before becoming widely adopted. I sympathize with those sentiments, but ultimately Node is developed by volunteers and features ship according to the timetable of when those volunteers have time to put in the effort, which means that sometimes things become popular before they’re finished. It’s unfortunate, but that’s all that Node can do with the staffing that it has. APM vendors can influence the pace of progress by contributing resources to Node’s development, but that would require investment on those vendors’ part.

And to echo Matteo’s point, Node 20 is a semver-major release. That makes it the best time to ship a breaking change, as semver-major releases is where breaking changes go, as opposed to 20.1.0 or some other minor release. It’s impractical and unfair to have asked our contributors to delay releasing their work until Node 21 in October. It’s also unnecessary to add a new flag to gate an experimental feature that is already flagged and with printed warnings and with extra-explicit warnings in its documentation “This API is currently being redesigned and will still change.” Sure, could we have done even more? Perhaps, but at some point one has to consider what’s reasonable to ask of Node and its contributors.

It sounds like the best case scenario from the APM vendors’ perspective is for the Loaders API to get to stable as soon as possible. There are only three TODO items left to get it to stability, and two of those have open PRs (one that’s ready to land, and another that’s in progress). I would suggest that the best use of vendors’ resources is to help us get these last few PRs over the finish line and the overall API moved into the “baking”/”watch for bugs” final phase before it is classified as stable. Then vendors can migrate their libraries to the new API, both for ESM and for CommonJS, with the confidence that comes from stability and semver guarantees.

Qard commented 1 year ago

If we consider ts-node and import-in-the-middle critical, they should be added to CIGTM to prevent them from breaking between releases or communicating to the maintainers that they were due for an update.

Yep, this was brought up in response to this change and I agree it's a step we should take. Though do we want things using experimental status features like that in CIGTM? I don't think these things breaking should block release, I just want awareness and communication around it earlier than us finding out a major change is landing literally the week of a major release.

CommonJS is stable too but the monkey-patching that APM vendors do to make instrumentation work there is completely undocumented and unsupported, not even experimental.

APMs have been moving away from monkey-patching for several years now. Datadog has been fully built on diagnostics_channel for awhile now, but we also depend on ecosystem adoption so it's not as simple as just building some new feature, opening a PR, and calling it done. Migrating an ecosystem takes a long time. Even without monkey-patching though we still need to know when modules are loaded so we know when to load the corresponding instrumentation.

ultimately what this request is asking for is for APMs to be able to do their jobs more cheaply.

That's not what this is about. APMs are big companies with money to work on the things we need to work on. The issue is not how much we have to spend, it's the things we have to depend on in Node.js core which can break suddenly. We're worried about customer applications breaking and them blaming us for what is actually just Node.js shipping breaking changes before we've had time to react.

I would 100% agree that the best solution to that is that we should just be involved in developing and maintaining those things, but convincing execs to support open source dev time is challenging. I'm working on it, believe me I'm very eager to get us more deeply involved in Node.js core, but OSS work is very hard to quantify in business numbers impact to justify investment in it. We're not shipping a feature that a customer is going to pay for, we're improving a system that will have some much more nebulous benefit to our dev teams. Unfortunately, as it stands currently, some of the most essential tools for production application maintenance are in conflict with the pace of development churn of the features we need to rely on in Node.js core and this is causing problems for everyone.

I'm not saying to stop or even to slow down, but I am saying to please try to be proactive about making sure your early adopters know the what and when of incoming code changes so we can be prepared. When developing new features we should be talking to our early adopters to gather feedback so we know what we're building is going in the right direction and will serve their needs.

I would say probably what we need here is some better communication channels for interested parties to track what's coming without needing to sift through the entire firehose of data that watching github repos provides, and especially in a way that is consistent across the project. Perhaps there could be regular project-wide user engagement meetings to discuss things in active development, gather feedback from early adopters on direction, and take any suggestions to help guide further development?

In my experience, having been a Node.js collaborator for many years now, our biggest weakness has always been in communication, especially with those outside core. If I as another collaborator was not expecting this change so suddenly, those outside of core definitely got caught off-guard here. I think we can do a lot better about communicating about what we're working on.

aduh95 commented 1 year ago

The move of ESM loaders to their own thread was a very long time coming (Bradley opened https://github.com/nodejs/node/pull/31229 back in 2021), in hindsight it seems obvious that this could have been communicate better. As someone who worked on that change, I must say I did not expect it to be so breaking, and there was some use cases that weren't on my mind at all because I had no idea this was how the ecosystem used that feature. I guess what I'm trying to say is that communication could be improved both ways. Hopefully, Node.js 20.x breaking your tools is another argument you can bring to your boss why you should be more involved in the dev process, and allocating resources to work on open-source is actually a good business decision 😇

Perhaps there could be regular project-wide user engagement meetings to discuss things in active development, gather feedback from early adopters on direction, and take any suggestions to help guide further development?

That seems related to https://github.com/nodejs/TSC/issues/1333.

Though do we want things using experimental status features like that in CIGTM? I don't think these things breaking should block release, I just want awareness and communication around it earlier than us finding out a major change is landing literally the week of a major release.

I'm not part of the Release WG, so I might be wrong here, but I think the process is to run CITGM on the candidate release, ping the package owner of the failing packages, but that wouldn't automatically block the release.

Qard commented 1 year ago

Yes, we're definitely trying to present this issue as a case for us to be more involved. A bit unfortunate that it takes things breaking to make that happen though...

And yes, I was aware this was a thing that people had been talking about for a long time, but I think the long time from idea to release was also part of the problem. It seemed like a thing that would maybe happen someday until suddenly that someday became now without really any noticeable communication about how close it actually was to that. I've been a passive observer of loaders stuff, I poke my head in when I have the time and sometimes engaged a bit, but I've been busy with other things lately so was not following it so closely for the last month and then suddenly it became very immediately relevant.

And yeah, that sounds a bit related, though I think a two way channel is a bit more what we need to be sure we're actually reaching the community and not just tossing content over the wall and assuming it gets read. There's a ton of value to be had by engaging more directly with the bigger Node.js-using orgs out there and discussing how we can help each other. That also makes the case for getting more corporate-backed involvement much easier if we can extract not just ideas from these people but also effort when it would benefit their org.

GeoffreyBooth commented 1 year ago

The loaders roadmap is here: https://github.com/nodejs/loaders. You can watch that repo for changes. That would give you many weeks or months heads-up before something lands. Whenever a PR gets opened for a TODO item on that list, there’s a PR to the README to add a link to that PR to the readme; so you could look at the list for links to PRs to see which ones are getting close.

You can also join @nodejs/loaders if you aren’t already, to get pinged for all loaders-related issues and PRs. We have representatives from Yarn and ts-node on the team; they’ve been involved in defining the roadmap and reviewing the designs, and testing the PRs. If there are particular capabilities you’d like to see in the Loaders API, or issues you’re having with it that are more general than isolated bugs, you can open issues on that repo.

bengl commented 1 year ago

Hopefully, Node.js 20.x breaking your tools is another argument you can bring to your boss why you should be more involved in the dev process, and allocating resources to work on open-source is actually a good business decision 😇

Oh hey that's me 😄.

I just want to point out to all the folks saying "Node.js 20 is not LTS, and folks should not use non-LTS in prod": Yep. I agree 100%. However, many folks run the latest-and-greatest version number available in prod, despite all advice to the contrary. This is a reality that APM vendors need to work with. I'm not suggesting altering any policy based on this. I'm just asking that that be taken into account in these discussions.

Anyway, I just opened a PR on CITGM that adds import-in-the-middle. If this is accepted, then maybe that ought to come close to resolving this thread, since I've put myself and @Qard there as maintainers? At least a step in the right direction, I'd hope.

Qard commented 1 year ago

The loaders roadmap is here: https://github.com/nodejs/loaders. You can watch that repo for changes. That would give you many weeks or months heads-up before something lands. Whenever a PR gets opened for a TODO item on that list, there’s a PR to the README to add a link to that PR to the readme; so you could look at the list for links to PRs to see which ones are getting close.

You can also join @nodejs/loaders if you aren’t already, to get pinged for all loaders-related issues and PRs. We have representatives from Yarn and ts-node on the team; they’ve been involved in defining the roadmap and reviewing the designs, and testing the PRs. If there are particular capabilities you’d like to see in the Loaders API, or issues you’re having with it that are more general than isolated bugs, you can open issues on that repo.

That's a lot of effort to follow activity on a single feature. There should be a simpler and more generalized approach for the ecosystem to keep up and get involved with what's coming without needing awareness of specific repos like that or tracking the firehose of PRs. Joining a team is also not really a thing that a random outsider can do. I'm not asking what I as a collaborator can do, I'm asking what any random early adopter in the ecosystem can do to keep up with progress.

mhdawson commented 1 year ago

I personally think the experimental part of the discussion is not the key aspect of what caused people grief in this case. We can change/break things that are in experimental but that does not mean we don't try to smooth things when we do that on purpose through communication in advance etc. In fact in a major(which covers the case being discussed) we can break things were are not experimental and when we do that we also work to communicate in advance to make things go a smoothly as possible.

In my mind the real challenge is that we try to rush stuff in just before a release, and this sometimes results in us breaking things without enough time in the main branch for them to be discovered. I'm not saying that we would have caught the issue if there was more time, but there would have been a better chance.

Qard commented 1 year ago

I think there's partly an issue of rushing stuff in late, but I think it's more a general communication issue that people outside the project don't really have a good way to know when changes are coming. Landing things as "experimental" implies an expectation that users start experimenting with it, and to do that adequately they need the data to know what the state is and what direction it's going. If we want to discourage people using something altogether why would we land it at all rather than waiting until the rest is done too? If something is truly not ready for experimentation then it could still land but should not be exposed without at least a build flag in front of it.

The main thing I think we should take from this thread is that we're not good at communicating the state and direction of experimental efforts. We should think on how we can improve that, and I'm not talking about contributor-facing visibility, I'm talking general users seeing the experimental-marked feature and wanting to try it out. We put these features in our docs so people are going to start seeing them without any exposure to the core code itself so we should be making sure those docs include enough context to understand why it's still experimental and what to expect from the direction of it.

mhdawson commented 3 weeks ago

@Qard do you think there is more discussion/concrete actions to take in order to move this forward. I don't see any comments in the last year and I'm wondering if it makes sense to close the issue.

Qard commented 3 weeks ago

Probably worth a discussion at the collab summit. I don't feel any concrete action or decision was made here, and the underlying communication problem remains.

I would really like to have at least some rules for clearer communication on the state of experimental things. For example, we could require that any experimental feature link to a context issue on GitHub with an up-to-date roadmap on development of the feature and links to related issues so someone considering trying this experimental feature has a clear place to look for status updates. It's a bit of extra effort to provide that paper trail during experimental status, but I feel like that's probably actually a good thing and would help discourage things being held in perpetual experimental limbo.

mhdawson commented 2 weeks ago

@Qard I agree experimental features would be a good topic for the summit. Not sure if there are still spaces for additional sessions but wondering if you might want to submit/lead one?