pnpm / supi

Fast, disk space efficient installation engine. Used by pnpm
MIT License
24 stars 5 forks source link

feat: allow to ignore files in packages #22

Closed zkochan closed 6 years ago

zkochan commented 6 years ago

The opts.ignoreFile() function allows to ignore files in packages. Ignoring means that files will neither be unpacked nor linked into node_modules

Ref https://github.com/pnpm/pnpm/issues/804

zkochan commented 6 years ago

@davej this is what we were talking about. This will allow to not unpack (ignore) some of the files of packages. pnpm will have some options to control it. Maybe, ignore-irrelevant-files=[false|safe|unsafe]

zkochan commented 6 years ago

Also @andreineculau, @vjpr please have a look. I appreciate your input

vjpr commented 6 years ago

I can’t see myself using such a feature. Seems risky. A package could potentially depend on something that is ignored and this would be very hard to track down.

Say you are working offline on a plane, you might want to see docs/ or readme.md.

Maybe if the performance improvement is big it might be useful.

zkochan commented 6 years ago

This won't be enabled by default right away. It will be a breaking change.

I see your point. However, the tarball with all the files will be on the disk anyway. So there will be a way to get the readme. I don't know how yet. Maybe a new command pnpm inspect foo or re-installation w/o ignoring files.

andreineculau commented 6 years ago

Had a question which was hard to track. Is the function taking dirname/basename or just basename? Supi -> package-store -> unpack-stream tells me that it is dirname/basename. Correct?

Now to the really hard questions: why? Especially why would pnpm be concerned with this, since pnpm is space effective?

Not only i am with @vjpr on this, but the problem that this solves is 1. not concerning a package manager 2. complex.

The problem afaics is distribution of your artifact where you may have an interest to minimize the size and skip "junk". You do not have an interest in not having those files locally when developing, and similarly you may want to have a dev/debug distro with all files included.

Now to the complex part - the definition of junk, files to ignore. Afaik including code without LICENSE or NOTICE or AUTHORS or similar may make you liable, depending on the license. Depending on the license, you may not even know which files are important for the legal part. Similarly some files look like junk but they are not depending on the context - typescript definitions, source maps, etc. Going further, some may decide to write literate code and distribute it as such - so markdown files may not be just text but code (not defending the authors just giving an example). This means that per node modules the rules are different, which the current ignoreFile doesn't account for. And then you reach the situation of blacklist and whitelist - you want to ignore all md files except this one, or except those in this repo... In one word it's complex (and complicated actually).

Fwiw nobody sane would want to remove files from a dep and distribute it, for the simple reason that you do not want to open black boxes. You may be forced to open a few because of a bug, maybe fork them, but not all. The fight should go into better practices like having the package managers notifying you that you seem to include junk, or in providing options to the installation process e.g. install with as little junk as possible (say aws-lambda could be configured with a list of patterns to ignore when decompressing) or in full. Note that the latter is still subject to then problems mentioned above, such as legal.

PS: a changeset regarding the quiet flag is included. Intended?

zkochan commented 6 years ago

@andreineculau, the reason to do this is speed. All fs operations are expensive. Ignoring one file saves: 1 disk write, 1 or more hard link creations (sometimes one package is hard linked several times to a project) and 1 integrity calculation.

Had a question which was hard to track. Is the function taking dirname/basename or just basename? Supi -> package-store -> unpack-stream tells me that it is dirname/basename. Correct?

It takes the relative names of files, so like lib/index.js.

Now to the complex part - the definition of junk, files to ignore. Afaik including code without LICENSE or NOTICE or AUTHORS or similar may make you liable, depending on the license.

This won't make us liable, it may make the users of pnpm liable. But I understand the issue. That's why ignoring of the LICENSE file won't be by default. It will be an "unsafe" option. Ignoring the LICENSE file will be possible with functionality like Yarn's yarn licenses generate-disclaimer.

PS: a changeset regarding the quiet flag is included. Intended?

quiet was removed from this package long time ago. I just noticed that it was not removed from readme. I can do it as separate commit.

This means that per node modules the rules are different, which the current ignoreFile doesn't account for. And then you reach the situation of blacklist and whitelist - you want to ignore all md files except this one, or except those in this repo... In one word it's complex (and complicated actually).

yes, I was thinking about maybe adding a second parameter to the ignoreFile function: (filename, manifest) => boolean. manifest won't be always available though

zkochan commented 6 years ago

@andreineculau. Also, look at this: https://github.com/tj/node-prune

People want it. There are certainly safe to delete files, Like rc files and CI configuration files. We can discuss that. Also, we can do it only for packages that have some permissive opensource license like MIT.

andreineculau commented 6 years ago

@zkochan yes, i know node-prune... first time I got down-voted on HN actually 🙀 https://news.ycombinator.com/item?id=15732845 Unfortunately it was misread as an attack on "why Go?" 🤦‍♂️ https://twitter.com/tjholowaychuk/status/932825437289320448

People want it.

I don't know how you asses that, but even if that's true, it's not an argument, not a strong one. People want lots of things, doesn't make it right, doesn't make it needed, and especially doesn't mean everyone has to do it (as in both package-managers and post-package-manager tools). If we talk details though, node-prune is a tool for those deploying many things (aws lambdas) very often. This pnpm feature may tackle that implicitly, but it's a superset - it will optimize installation time too, at the expense of other "optimizations" (i.e. one can argue that having all files may be an optimization versus the state of having only must-have files).

This won't make us liable, it may make the users of pnpm liable.

Correct. Didn't imply pnpm would have an issue.

Ignoring the LICENSE file will be possible with functionality like Yarn's yarn licenses generate-disclaimer.

That is not the same as having the legal left in place. That functionality simply makes it easier to grasp the licenses of the dependencies. PS: I have been through legal-analysis processes twice. It's a murky territory that you just don't want to touch.

All fs operations are expensive. Ignoring one file saves: 1 disk write, 1 or more hard link creations (sometimes one package is hard linked several times to a project) and 1 integrity calculation.

It would be really valuable to run pnpm before this change on a really dependency tree with lots of files (I know that some tweets to tj mentioned going from 600 MB to 150 MB), and compare to a pnpm version that ignores files (say the same files as node-prune?) and see how much time it actually saves. FWIW my team's node_modules is pretty large (with lots of duplication because we want the level-0 dependencies to be standalone; still on npm, not pnpm unfortunately) and it's still below 600 MB : 511 MB with 76461 files. I regard that situation to be in the top 10% percentile, if not even higher i.e. we are not the common Joe installation yet we still manage just fine. If I am to make a bet, the gain we would get from this change is nowhere near the gain that we get by using a cache + tmpfs, yet the complexity would increase considerably.


I noticed now something in what you said that makes this bad in a binary way. You're saying that you will have a hard-coded list somewhere (since you may leave LICENSE files in place, or even do it just for MIT licensed packages), and that it will be on by default in pnpm ? So users like @vjpr and me will suddenly need to add a flag to install everything in full ? For me, this is a total opt-in with a hook (leaving the decision of what to ignore solely to the pnpm user) and/or a total opt-in with a hard-coded list. Definitely not a opt-out.

zkochan commented 6 years ago

Adding this to the API doesn't mean it will be used by pnpm. I know "people want" is not an argument. The argument is speed. I will run the benchmarks and check whether it improves speed.

davej commented 6 years ago

Really interested in seeing benchmarks, both install time and disk space savings. My feeling is that I'd be very wary of making this a default and would just keep it as an option (or a 'mode' in pnpm). Would be interesting to see benchmarks on both a mechanical and SSD if possible.

vjpr commented 6 years ago

Just remembered I've seen some packages have post-install steps that run tests...

For development, I probably just want everything (and even more...sometimes i want the original source files that have been stripped using the package.json files key).

For production, I guess I would like a leaner deploy for a docker image or something. But I would not use it because small differences between dev/prod are the toughest to track down.

andreineculau commented 6 years ago

@zkochan since i have the tmpfs setup in addition to @davej 's setup, would it be possible to set up a bash script that would run the comparison? We would then transform ourselves from requesters of data to data contributors

zkochan commented 6 years ago

I use this project to measure pnpm's performance: https://github.com/pnpm/node-package-manager-benchmark

I can publish this changes and then either add a config to pnpm or make it configurable via pnpmfile.js.

zkochan commented 6 years ago

I ran the benchmarks locally with "safe" ignoring. Here are the results: https://github.com/pnpm/node-package-manager-benchmark/commit/cc747f63a4330340661c48ab9c04232a1fe06bae

Seems like it gives approximately 5% speed boost on projects that have around 600-1000 packages in node_modules

vjpr commented 6 years ago

Maybe size difference and file count difference would be useful too.

zkochan commented 6 years ago

@vjpr good idea. I'll do those measurements later

Here's the follow-up CR in pnpm: https://github.com/pnpm/pnpm/pull/947

vjpr commented 6 years ago

Just leaving a note here about recently released Turbo that we should examine for inspiration: https://medium.com/@ericsimons/introducing-turbo-5x-faster-than-yarn-npm-and-runs-natively-in-browser-cc2c39715403

vjpr commented 6 years ago

Re: Turbo. Any file that is accessed that wasn't unpacked, will be downloaded on access using Microsoft gvfs. So it is always "safe" no matter what.

I still think this feature is too risky for pnpm.

vjpr commented 6 years ago

Here is a list of files that node-prune is using https://github.com/tj/node-prune/blob/master/prune.go