Suggestion: Provide standards around integrity between source code and published package

shaunwarman commented 5 years ago

Right now, it seems that most maintainers may publish their packages from their local environment. There should be a way to verify what is published against the public source code or specific git sha to maintain transparency of what is being published. Not only will this mitigate out of sync issues or accidents, but will provide greater confidence that additions aren't added as they are published (potentially malicious).

Not sure if this is the best place for this, but after reading through other issues and recent resources I thought I better put this down somewhere. And it brings up the discussion of maintainers permissions to not only package registry, but SCM as well.

sompylasar commented 5 years ago

That does not guarantee anything if the package is not built with a lockfile in place. The output will certainly differs.

Yes, that's why I'm talking about being able to mark as verified only the packages that conform to a package build process and maintenance convention (there can be more than one supported).

ljharb commented 5 years ago

I would be strongly opposed to any certification that required packages to have lockfiles.

sompylasar commented 5 years ago

@ljharb

I would be strongly opposed to any certification that required packages to have lockfiles.

Would you be opposed to reproducible builds from source at a commit?

I understand that package lockfiles do not lock the surrounding environment and you may produce a different build having a different version of Node.js for example, but they at least pin down the most dynamic part of it, the node_modules dependencies.

ljharb commented 5 years ago

Since it would take a lockfile to achieve that, yes.

sompylasar commented 5 years ago

@ljharb

Since it would take a lockfile to achieve that, yes.

This is a very strong opinion background of which isn't clear to me. Please elaborate.

What are the other ways you can think of to achieve the reproducible builds from source if not lockfiles?

Which ways of verifying package integrity and lack of injected malicious code do you think of?

ljharb commented 5 years ago

Reproducible builds are critically important for apps; I don't find any value in that for packages. The artifact (the tarball) is the reproducible part; if i want to make another build it'll be a new version, so there's no reason it needs to precisely match the previous one.

skonves commented 5 years ago

My understanding is that there are a few opposing schools of thought concerning committing lock files.

Perspective 1 The presence of a lock file in a package's repository doesn't influence the dependency tree once the package is installed.

If the package developer (let's call her Grace) uses a lock file, then Grace will see a certain dependency tree. When another developer (let's call him Martin) installs the package, the installation process does not respect the lock file, meaning that he will see a potentially different dependency tree. (This is known good behavior and I won't jump into the rationale here.) Stated differently, the lock file only locks dependencies on the developers (Grace's) machine, but not for consumers (Martin, et al).

From this perspective, the lock file doesn't provide value to consumers and could potentially even create confusion around what will be installed. This is a good argument for not including the lock file.

Examples: https://github.com/expressjs/express

Perspective 2 Lock files control the specific dev dependencies that are used during the build process and thus control the artifacts within the package.

For projects that require a build script, if the package developer (Grace) doesn't use a lock file, then build output may vary between installs. With a lock file, running npm ci prior to running the build script guarantees that the output of the build is deterministic for a given set of source code. When installed by a consumer (Martin), the build output is now just static files contained within the package, dev dependencies are not installed, that the latest allowed production dependencies are installed.

From this perspective, the lock file provides consistency to the build process. Grace or a CI system (let's call him Travis 😄) can run npm ci then npm run build and get deterministic results. In this case, the lock file does provide value and thus should be committed, even though it doesn't affect the dependency tree in Martin's application.

Example: https://github.com/reduxjs/redux

My current opinion Based on these two perspectives, I don't think every repo needs a lock file because based on the nature of the repo it may or may not provide value. That being said, the "missing-ness" of a lock file shouldn't cast shade on the legitimacy of a package.

Right now, the TBV tool makes its first attempt at packing the code without installing deps. For packages like express that works great :+1:. For packages like redux, pack can't run because the prepare script needs to run babel's rollup. This is where it runs npm ci to ensure that it has the same dev dependencies as when originally published. This is also where TBV currently fails with react because that repo doesn't have a lock file.

What I hadn't previously considered was to just run npm i in the absence of a lock file (instead of ci) in an attempt to recreate the build artifacts with the latest allowed dev dependencies.

...if the package developer doesn't use a lock file, then build output may vary between installs.

I hadn't considered the case where it doesn't.

justinmchase commented 5 years ago

@ljharb @sompylasar

Perhaps we should establish the Code Law of Identity; If you build the same code twice it should have the same end result.

You don't have to publish your lock file but you should probably have it in your repo so that from the perspective of any given packages codebase it produces identical builds from identical source.

Perhaps if npm just accepted an optional git commit hash during publication then 3rd parties (including npm possibly) could verify that the package respects the law of identity by checking out that commit and re-building the code. You could then provide a guarantee to downstream consumers that the code they are pulling is precisely what is committed in github.

From there an optional flag or command could be added to npm that would produce errors or warnings if packages are pulled that aren't flagged by npm to have been verified, or perhaps they are not pulled at all until they are flagged as being identical. Some wouldn't care about this but some might.

This would allow us to have some assurances that the code we are pulling from npm is both publicly visible and matches what was published.

There's no known way to prove code isn't malicious in an automated way (yet) but given a foundation of opennes and identity, we could then possibly establish a level of confidence based on some other facts about the committed code, such as if it was reviewed first, if it triggered CI, if there were any issues associated with it, independent audit, etc.

No matter how we handle it we'll probably have to always be diligent and rely on community but that is part of the principles of open source. What we can automate and prove, we should try to do when possible. It will help increase confidence and quality.

sompylasar commented 5 years ago

@skonves Thank you for describing the two perspectives!

@justinmchase Yes, thank you, we're on the same page here.

sompylasar commented 5 years ago

@ljharb

Reproducible builds are critically important for apps; I don't find any value in that for packages. The artifact (the tarball) is the reproducible part; if i want to make another build it'll be a new version, so there's no reason it needs to precisely match the previous one.

In fact another build from the same code (which without semantic-release automation contains a version, and the commit being published also most likely has a version tag) is another build of the same version, but not another version. In semver there's sometimes a separate fourth field called build number which describes that (see e.g. Chrome version).

sompylasar commented 5 years ago

Inbefore, I understand that npm prohibits publishing the same version twice, if that's what you were referring as a new version @ljharb

dominykas commented 5 years ago

The lock file does not have to live in a commit, nor does it have to live inside the tar. It is trivial to automate publishing it as part of GH release artifacts. This means that you have it, in case you need it, but you don't force it onto your contributors or users (and yes, I have little love for lockfiles myself, but we do keep them around for when we need to patch production, in a similar way as I described here, ie not affecting npm installs, unless explicitly needed).

Is that something a lot of maintainers will do? Probably not. Is that something the folks in this fine maintenance WG can help with? Definitely. Are reproducible builds something that businesses would be interested in? I think so, if that can be provided without overheads. So I think it's in scope and moves towards goals of this WG?

I think there's a lot of ways to solve the original problem stated in the topic of this thread. I'm very happy to see userland attempts. Maybe we need a break here, as part of this WG, and let these attempts move forward, maybe invent some competing ideas, and bring them back to build a standard and a recommendation.

It is very clear that there is interest in this, and there is value in this in at least some contexts, so I'm honestly surprised at the opposition. What are the downsides of establishing such a standard?

mcollina commented 5 years ago

Myself and a lot of other maintainers with a huge numbers of packages think lockfiles are a bad idea for modules. I hope this clarifies our position: https://github.com/sindresorhus/ama/issues/479#issuecomment-310661514.

mcollina commented 5 years ago

lockfiles are a very good idea for applications, which is the high majority of users. However, they are not really the target of this initiative.

dominykas commented 5 years ago

@mcollina this is not about lockfiles. I don't think they make sense as part of the package or part of the repo. They do make sense as a snapshot in time to achieve reproducible builds.

Once again - they should not be published on git or npm - only kept around as a convenient metadata format.

ljharb commented 5 years ago

@dominykas package-lock is already not published on npm; it'd be fine to keep a lockfile around if it didn't constrain npm install, but the only way to do that would be to rename it and commit it to git.

That might be reasonable, however - a sort of package-release.json that you'd update in-place in git, and would effectively be npm install --package-lock-only --package-lock && mv package-{lock,release}.json right before publishing a release.

dominykas commented 5 years ago

@ljharb that is also an option, but it depends on the toolchain. I've literally spent this week setting up semantic-release, because we have locked masters and therefore can't have version commits. This also means that we need to push release meta data outside of git repo. We also have a pattern where we push shrinkwraps into dangling tags (i.e. a shrinkwrap-X.Y.Z for every vX.Y.Z). Basically anything but master (or npm, for various reasons) :)

I wish npm had a way to store this, and some other, release meta data, but not necessarily distribute it with the package.

styfle commented 5 years ago

A possible solution to this problem (integrity between source code and published package) is to provide a "builder" service that take the source code, builds it, then publishes the output.

ZEIT Now is probably 80% of the way there because builds are immutable, every build has a unique URL, the source code is captured during the build process, and you don't need a repo to publish. But if you do have a repo, you can connect it to the build system so that pushing to master (or any branch) automatically publishes.

There's a couple changes that would need to be implemented in npm:

The first missing piece is some way to link the published npm package back to the builder source so that npm (and users) can verify integrity (probably add a field in the generated package.json).
The second missing piece is some sort of allow-list that npm uses to trust certain builders (if anyone could host their own builder service, they could circumvent this integrity).
The third piece is an optional flag that users can use to opt-in to this type of integrity check. Something like npm install --builder-integrity that will stop with an error if any package is missing this new builder field in the package.json or the builder field is not in the allow-list.

Bene-Graham commented 5 years ago

A possible solution to this problem (integrity between source code and published package) is to provide a "builder" service that take the source code, builds it, then publishes the output.

ZEIT Now is probably 80% of the way there because builds are immutable, every build has a unique URL, the source code is captured during the build process, and you don't need a repo to publish. But if you do have a repo, you can connect it to the build system so that pushing to master (or any branch) automatically publishes.

There's a couple changes that would need to be implemented in npm:
1. The first missing piece is some way to link the [published npm package](https://react-tsx.now.sh) back to the [builder source](https://react-tsx.now.sh/_src) so that npm (and users) can verify integrity (probably add a field in the generated `package.json`).

2. The second missing piece is some sort of _allow-list_ that npm uses to trust certain builders (if anyone could host their own builder service, they could circumvent this integrity).

3. The third piece is an optional flag that users can use to opt-in to this type of integrity check. Something like `npm install --builder-integrity` that will stop with an error if any package is missing this new builder field in the `package.json` or the builder field is not in the _allow-list_.

I was initially thinking the same thing.

But what stops someone from having a build task that inject malicious code into the "built package"?

Wouldn't a per module sandbox / permission system to protect against certain actions (Network, Disk Access, etc) help?

freewil commented 5 years ago

@styfle

A possible solution to this problem (integrity between source code and published package) is to provide a "builder" service that take the source code, builds it, then publishes the output.

I was also thinking this. Basically the idea is to have a trusted third-party do the packaging from source.

Docker Hub might be a good place to look for inspiration. With Docker, it can be an even bigger issue, as docker images are built into binary files and pushed up to a public repository. Although, Dockerfiles/images are generally much more simple.

https://blog.docker.com/2013/11/introducing-trusted-builds/

sompylasar commented 5 years ago

But what stops someone from having a build task that inject malicious code into the "built package"?

There shouldn't be any user-provided build task code while using one of the trusted builders, only a build pipeline configuration, for example, like https://github.com/pikapkg/pack does.

Wouldn't a per module sandbox / permission system to protect against certain actions (Network, Disk Access, etc) help?

Yes, that, too, may increase confidence, and complexity. That is needed, too, but is orthogonal to this particular discussion about package build-from-source integrity.

ljharb commented 5 years ago

The build process itself needs to be able to contain user-written transforms, if it's babel, or a bundler, to support common and valid use cases.

sompylasar commented 5 years ago

The build process itself needs to be able to contain user-written transforms, if it's babel, or a bundler, to support common and valid use cases.

It doesn't if all the common and valid use cases like babel and bundler are standardized into trusted builders. I guess there should be either no way, or a hard and impractical way, to tamper with the executable content (e.g. JavaScript code) of a trusted build by providing a malicious build step.

But again, I'm not a security expert. I pinged the Node.js SecurityWG to take a look at this discussion as I think it is rather opitionated, not engineered as a document with clear sets of requirements, tradeoffs and solutions.

travi commented 5 years ago

It doesn't if all the common and valid use cases like babel and bundler are standardized into trusted builders

i think you missed a very important detail in the point you quoted:

contain user-written transforms

part of the valid use of babel is to apply custom transformations beyond those provided by babel itself or even official plugins. these are often included as devDependencies. locking down approved builders that didnt allow the inclusion of those custom transformations would eliminate a significant portion of the value of tools like this, but allowing them does allow for changes that would be difficult to ensure are not malicious.

sompylasar commented 5 years ago

part of the valid use of babel is to apply custom transformations beyond those provided by babel itself or even official plugins. these are often included as devDependencies. locking down approved builders that didnt allow the inclusion of those custom transformations would eliminate a significant portion of the value of tools like this, but allowing them does allow for changes that would be difficult to ensure are not malicious.

I'd like to see some data about how popular these are, and whether they can be standardized. They can be published as a trusted package that itself does not use them, thus building a chain of trust. I hear you, but there's always a tradeoff between security and convenience.

ljharb commented 5 years ago

@sompylasar packages are all already implicitly trusted; the attack vector i thought you're trying to address here is when a trusted package is hijacked, for which the same risks exist for any package, including ones in the build system.

sompylasar commented 5 years ago

@sompylasar packages are all already implicitly trusted; the attack vector i thought you're trying to address here is when a trusted package is hijacked, for which the same risks exist for any package, including ones in the build system.

Yeah I guess you're right. Well that's what I mentioned above. We all would benefit from a document (like an RFC) with the list of potential attack vectors, and how they are addressed or can be.

skonves commented 5 years ago

I have actually already put together a draft of a graph of the NPM ecosystem for just that purpose. Each edge and vertex definitely has its own unique set of challenges. For the past few evenings, I have been working on a writeup, but I wouldn't mind opening it up for community feedback/contribution.

Here is the draft of the graph I have so far: ecosystem draft

I have a draft on Medium right now, but I will move that to a more public place in a heartbeat if it means more eyes on it. Thoughts?

FWIW, my work with TBV/verifynpm and @sompylasar's similar work with npmverified seems reasonably confined to the "publishes to" edge near the top; however, I strongly agree with Ivan that understanding the full picture helps inform any proposed solution (particularly what is in and out of scope).

JamesMGreene commented 5 years ago

Wouldn't a per module sandbox / permission system to protect against certain actions (Network, Disk Access, etc) help?

I saw an interesting post about this idea from the NPM perspective not too long ago: https://hackernoon.com/npm-package-permissions-an-idea-441a02902d9b

CC @davidgilbertson

tlhunter commented 5 years ago

FWIW, we built Package Diff to gain visibility into npm package updates by directly examining the package instead of looking at the potentially-misleading source code repository: https://diff.intrinsic.com

sompylasar commented 5 years ago

FWIW, I built package diff for the purpose of comparing next prepared version with the latest published one for the review and version bump purposes https://github.com/sompylasar/zmey-gorynych/blob/9029972/bin/cli.js#L512

justinmchase commented 5 years ago

@tlhunter very nice.

I had one thought, one way to get around it. What if you were using node-pre-gyp which fetches pre-compiled binaries rather than builds them from the source in the package and someone malicious had uploaded a pre-compiled binary from something other than the source that is published, how could we verify that they match?

tlhunter commented 5 years ago

@tlhunter very nice.

I had one thought, one way to get around it. What if you were using node-pre-gyp which fetches pre-compiled binaries rather than builds them from the source in the package and someone malicious had uploaded a pre-compiled binary from something other than the source that is published, how could we verify that they match?

For the system to be perfect we'd need to throw away the concept of npm install scripts. One could perform an npm install of every package ever made, on a set schedule, and constantly diff the artifacts left behind, but even that wouldn't be perfect. Like, the fetching of pre-compiled binaries might include information about the host machine. The event-stream incident, for example, would only inject malicious code if certain criteria were met.

I'd like to see some sort of package manager service which does the following:

Distributes both a tarball AND is the git repo
1. The user wouldn't be able to generate tarballs
2. The server would guarantee correlation between git tags and tarballs
Disable all installation scripts

justinmchase commented 5 years ago

Yeah, I agree that sounds like the only way. I could imagine npm flagging releases that were published that way and you could then opt to only accept verified dependencies if you wanted.

It would be tricky for pre-compiled binary modules as they could have a large build matrix for platforms and versions but it would be enormously helpful for confidence and security.

shaunwarman commented 5 years ago

Interesting and related proposal in the golang ecosystem: https://go.googlesource.com/proposal/+/master/design/25530-notary.md

nodejs / package-maintenance

Suggestion: Provide standards around integrity between source code and published package #77