raineorshine / npm-check-updates

Find newer versions of package dependencies than what your package.json allows
Other
9.18k stars 319 forks source link

Batch npm fetch #961

Open make-github-pseudonymous-again opened 2 years ago

make-github-pseudonymous-again commented 2 years ago

Steps to Reproduce

Run ncu with >100 dependencies and a slow internet connection.

Current Behavior

It's slow.

Expected Behavior

It's fast.

Suggested solution

Ask npm to expose a route where one can query for more than one package at a time. It could even explicitly be an updates available route that only replies with a list of updates given current version, not with a list of last available versions. They will have to put limits on this route like maximum payload (ncu resolves wildcards locally), but on average I think it would lighten the load on their side and make it faster for us.

raineorshine commented 2 years ago

Thanks for the suggestion! Have you checked if the npm registry already supports fetching multiple package manifests in a single request? I presume not, but it would be worth confirming. If they don't support them, it would be a great feature request for npm (or npm-registry-fetch).

make-github-pseudonymous-again commented 2 years ago

Hi @raineorshine, I haven't searched for it.

make-github-pseudonymous-again commented 2 years ago

Hi @zkat, @isaacs, and @wraithgar! Any idea who to contact to discuss a potential addition to npm's API?

wraithgar commented 2 years ago

Changes to the npm registry itself would have to go through support. The npm cli interacts with that and many other registries (i.e. github packages registry, private registries, azure artifacts). If that route were added to any of those registries the cli could then be updated to use it.

make-github-pseudonymous-again commented 2 years ago

Thanks @wraithgar.

@raineorshine Shall I open a ticket?

make-github-pseudonymous-again commented 2 years ago

Hi @raineorshine! I would like to open a npm support ticket to request this API addition we would need. Do you give me permission to explicitly mention in this ticket the reason for this request is to improve the running time of npm-check-updates?

raineorshine commented 2 years ago

Yes, go for it!

make-github-pseudonymous-again commented 2 years ago

Here is what I sent:

New API route for batched version update information fetching

Dear support employee,

A tool I use to keep my npm packages (direct) dependencies up to date is relying on many "parallel" API queries to fetch relevant information. Currently this process running time scales quite badly with the number of dependencies. There are several ways this could be improved, and some involve having a better way to query the npm registry for information. For instance:

  • The tool currently makes one query per (direct) dependency. I think that being able to query the registry for a reasonable amount of packages in a single batch would be significantly more efficient.
  • The tool currently uses pacote.packument which does not seem to allow to query for only a subset of the fields. I think that being able to specify which fields are needed in the response would help further reduce the resource consumption of these requests.
  • The tool currently does not make use of last-modified headers, and I am not sure how these could be used with pacote.packument.
  • The tool currently does not exploit the fact that our queries are very specific: we simply want to know whether there is a more recent version available than the one configured in the package.json while respecting optional semver constraints (^, etc).

The tool's name is npm-check-updates and a relevant issue was opened at https://github.com/raineorshine/npm-check-updates/issues/961.

For the last two points: the tool is currently unable to exploit pacote's caching because the caching strategy used has no usefulness in checking for up-to-date information: see https://github.com/raineorshine/npm-check-updates/pull/578

Kind Regards, x

Regarding the third point: could this be improved now without an API change?

make-github-pseudonymous-again commented 2 years ago

Hi @raineorshine. So far I did not get any reply from NPM. However, I believe some improvement may done independently of their reply. Regarding the following remarks (copied from the message I sent to NPM):

  • The tool currently does not make use of last-modified headers, and I am not sure how these could be used with pacote.packument.

and further

For the last two points: the tool is currently unable to exploit pacote's caching because the caching strategy used has no usefulness in checking for up-to-date information: see https://github.com/raineorshine/npm-check-updates/pull/578

Two questions:

raineorshine commented 2 years ago
  • Can pacote caching strategy be mended to honor last-modified headers (or does it already honor them)?

That's a great question. I don't know the answer off-hand, but I'd be curious to know.

  • If not? Can we use something else than pacote?

Currently we only use pacote.packument, so there is a small surface area. We could use a lower level npm fetcher directly. However, pacote uses the npm config for things like proxies and registry authentication, which seems daunting to recreate.


I should mention that npm-check-updates actually does batch fetch packages. I don't know where my head was at when I neglected to say that at the beginning 🥴. The default concurrency is 8, and can be changed with the concurrency cli option.

https://github.com/raineorshine/npm-check-updates/blob/d5f3778f7447e66f0f7048661d481d742e7802b2/src/lib/queryVersions.ts#L111

However, as you know, it still does them in separate HTTP requests. Just thought it would be useful to note here.

wraithgar commented 2 years ago

So ... this issue made me do a little digging into pacote / npm-registry-fetch / make-fetch-happen and it turns out that the npm cli is not leveraging http agents, even a little. make-fetch-happen creates one but it only uses it once per call, and would only end up reusing a connection on things like socket timeouts and connections refused ... the one time it won't do any good.

I've brought this up internally and we'll probably look into it for the cli. Unfortunately for you even though pacote accepts an agent arg, make-fetch-happen ignores and clobbers it https://github.com/npm/make-fetch-happen/blob/main/lib/remote.js#L44

This is definitely a solution here, allowing folks to pass in their own agent to actually utilize node's http agents properly. I predict that we'll also make the cli or pacote do this by default too, allowing for folks to still pass their own if they want.

wraithgar commented 2 years ago

I was mistaken! agent is obeyed by make-fetch-happen!!! https://github.com/npm/make-fetch-happen/blob/main/lib/agent.js#L44-L46

You should be able to generate your own keepalive agent and pass that in as agent to pacote opts.

wraithgar commented 2 years ago

After a little more digging, this time at an actual computer, it looks like make-fetch-happen does reuse agents, storing them in the lru cache

https://github.com/npm/make-fetch-happen/blob/main/lib/agent.js#L57-L59

The agents are keyed by the opts passed in so if they're not changing, you're getting the same agent.

TLDR keepalive is already being utilized for both http and https calls with pacote.

wraithgar commented 2 years ago
* Can `pacote` caching strategy be mended to honor last-modified headers (or does it already honor them)?

The caching in the npm toolset was overhauled last year and does honor this header. It's one of the headers that is explicitly stored in the cache: https://github.com/npm/make-fetch-happen/blob/dadba9a2d5fc718c9179d9fe13c436ac4d69303d/lib/cache/entry.js#L42

The caching semantics themselves are handled by https://github.com/kornelski/http-cache-semantics, which does properly use this header.

However, it seems that because your module does something extremely specialized, it can use a route that the npm cli can't for fetching packet versions. The npm cli has to fetch the full packument so that it knows ALL the versions, which is needed when calculating a full node_modules tree to satisfy semver ranges of all the packages in that tree. Because this module only ever cares about the latest version, you can use the npm registry route that does only that. It's much much less data to request across the network.

$ npm dist-tag ls npm-check-updates
latest: 12.4.0
next: 12.2.2-1

$ curl https://registry.npmjs.org/-/package/npm-check-updates/dist-tags
{"latest":"12.4.0","next":"12.2.2-1"}⏎

Caching on that route is less optimized, because it's returning metadata from the packument. However, the npm registry only ever tells caches to store packuments for 5 minutes anyways, because they are the data that changes when new versions are published. For all intents and purposes they are inherently out of date. The 5 minute value is mostly to prevent a firehose situation on the registry. In fact the npm cli informs the cache to fetch it from the registry if at all possible: https://github.com/npm/cli/blob/2db3eff44d7bb86b956207cc63d279806fd14ed0/lib/commands/dist-tag.js#L188-L190

It's worth looking into given the nature of this module.

wraithgar commented 2 years ago

Here's a gist showing how to do far fewer network requests to find the newest version of a given package is available from the npm registry. I limited the pagination to 10 just to show pagination working. Be careful making that query too large because there is a limit on url length, though I don't know offhand what that limit is

https://gist.github.com/wraithgar/d5c974ee8e4a7e08ded89742c15e96ce

This one does semver comparison to NOT do breaking changes. You could just not pass in the existing manifest semver to just get the max. This does NOT take into consideration the latest tag of a package. It's purely semver comparison.

ETA: the dependencies endpoint is NOT tag aware, so you would have to pass in * instead of ${encodeURI(manifest.dependencies[d])} on line 13 in order to have the registry return all versions, not just ones bounded by the current semver of the package.json

wraithgar commented 2 years ago

I updated the gist to give examples of "highest version possible without changing the package.json" which is effectively what "npm update --save" will do (dunno if that shipped yet), and "highest semver version possible based on everything in the registry"