Create an affected function format specification for npm (JS) advisories

taladrane commented 1 year ago

Hi OpenJS friends 👋 we'll be joining your Security WG meeting this upcoming Monday, October 2, to discuss the below idea, but please feel free to comment with feedback at any point! We're hoping to be able to finalize the discussion by October 16th 🙇‍♀

Background

GitHub released a beta feature last year that allows Dependabot alerts to surface calls to vulnerable functions to end users. The goal of this feature is to make prioritization and response to vulnerability alerts easier by sharing with users both if they’re actually using the snippet of affected code in their project and, if so, where they’re calling the vulnerable code. This beta feature is currently released for the pip/Python ecosystem, and we’re exploring adding support for the npm/JavaScript ecosystem. We’re engaging the community for feedback on this proposal to ensure what we’re doing is sound as it will ideally end up impacting this community positively.

One of the things we’ve discovered along the way is the need for a standardized format for the affected function names we use to refer to symbols that appear in npm (JS) advisory source code. The names given are included in metadata for the published advisory and reflected to the end user in their Dependabot alert. For our purposes, we are only considering exported functions in an npm package, so we’re hoping to avoid the complexity of defining a universal specification for all of JavaScript. The function name should be intuitive, easily parsable, and the elements of the name will be valid objects in a JS program.

Proposed affected function name requirements:

A name represents the distinct hierarchic call graph or access path relative to its package.
Names can be deterministically parsed
Affected functions are named relative to their vulnerable package, with no consideration to how they are imported downstream
An affected function name is stable across unrelated changes
An affected function name should not be location (in a file) based (e.g. row, source, column)
A human can read the name and have some intuition about what it represents

Proposal

(npm_package).foo.bar.biz.baz

where foo.bar.biz.baz is a walk of a public path from the root of the package. Having (npm package) called out explicitly in the function name itself rather than being inferred from affected product data gives the benefit of capturing a root level anonymous function, which are quite common in the npm world. Encapsulating the package name in ( ) is done to aid machine parsability. This also happens to match what a user would write to use a single function package and how it would get laid out in documentation for users.

ljharb commented 1 year ago

Be aware that in a package without the exports field, every file is accessible, and so you may indeed need to account for most of the language.

ljharb commented 1 year ago

Also, there is not one single “root” - packages have 0, 1, or N entry points, and all must be considered.

lsto commented 1 year ago

Be aware that in a package without the exports field, every file is accessible, and so you may indeed need to account for most of the language.

Do you have a link to such an example package?

ljharb commented 1 year ago

https://npmjs.com/es-abstract has thousands, as an example of one of my own packages.

mhdawson commented 1 year ago

Many packages would be like this one with 'main' being used instead of 'exports' - https://github.com/nodejs/node-addon-api/blob/main/package.json.

In terms of the proposed function format, it seems reasonable, but I think planning to have multiple paths to the same function to which the advisory applies (assuming multiple exports or assuming all files are accessible) would be necessary.

I assume that foo is the name of the affect function in foo.bar.biz.baz. In that sense foo almost seems to be out of order in terms of the walk but I'm not 100% sure on the order of the rest of the functions from the description. I'm wondering if something like (npm_package).bar.biz.baz.foo would make sense with bar being the top extneral function , and baz the functions that calls foo ?

darcyclarke commented 1 year ago

@taladrane / @darakian thanks for sharing your time & this spec with the group in today's call. I'll try to reiterate some of the feedback I had which is atop of +1-ing @ljharb & @mhdawson's comments:

1. On Requirements

1.1. 🤔 "A name represents the distinct hierarchic call graph or access path relative to its package."
- how could a "call graph" be determined if you don't consider downstream consumption (ie. 1.3)?
1.2. ✅ "Names can be deterministically parsed"
- this makes sense IF 1.1 can be clearly explained/understood
1.3. 🤔 "Affected functions are named relative to their vulnerable package, with no consideration to how they are imported downstream"
- this seems incompatible with your goal 1.1 of representing a "call graph"/"access path"
- conceptually, you'll be representing a reachable path which would be indistinguishable from how a downstream consumer would/could interact with the package
1.4 ✅ "An affected function name is stable across unrelated changes"
- this makes some sense to me although...
1.5. 🤔 "An affected function name should not be location (in a file) based (e.g. row, source, column)"
- why?
- having an explicit selector &/or set of selectors that map vulnerable function calls (similar to a stack trace) seems like a perfectly fine/valid method of identification
- this is a strategy other tools take (ex. SemGrep -> ./program.js:1)
1.6. 🤔 "A human can read the name and have some intuition about what it represents"
- the path to a vulnerable function call should be, above all else, accurate
- making a selector that is virtual in nature (ie. respecting goal 1.5 & not providing path information) is at odds with clearly defining the mental model a developer will need to understand it

2. On the Proposed Grammar

(npm_package).foo.bar.biz.baz

I'm not sure what npm_package will represent. As others already noted, the "default" export of a package could be anything. Package names can also be duplicative across different JavaScript package registries &, notably, within projects/packages you'll often find forked/patched/aliased versions of packages which should not be incorrectly attributed to package's of the same name distributed in other locations. It's already a problem today that npm is a catch-all term used for the JavaScript package ecosystem when there's actually alternative package managers whom interpret the package names/specifiers differently.

Example project

# github.com/<owner>/lodash/
├── program.js (A) -> `const _ = require('lodash'); _.at(); // vulnerable call...`
├── package.json (B) -> { "name": "lodash" ... }
└── node_modules
    └── lodash
        └── package.json (C) -> { "name": "lodash" ... }

# registry.npmjs.com/lodash/<version> (D) -> { "name": "lodash" ... }

In the example above, the specifier lodash.at is ambiguous when you compare it to the various usages of the package name "lodash". It could be incorrectly inferred that the selector is for the root/project-level package or the nested/local package inside node_modules &/or even the public registry package (which is most often the case you intend).

I think its probably best to remove the package_name altogether & ensure you pass around the explicit "call graph" as referenced in 1.1. Any other package information/metadata can be linked out to/referenced aside from the vulnerable path to the function call (ex. a vulnerable function call from the npm package lodash could be "_.at()" in my project's program.js & then link out to an explicit CVE)

If you really need to have the name in the schema, it's probably best to be explicit & reference either an integrity hash (similar to how commits are treated & referenced with git - ex. (<package name>:<hash>)<dot delimited, prototype chain>) OR a well-defined protocol prefixed to the package with the version information suffixed (similar to how the existing package specification works for the package manager - ref. https://docs.npmjs.com/cli/v10/using-npm/package-spec).

Example alternative grammars

For a vulnerable function call of _.at() inside the npm package "lodash" you could express that like...

_.at OR (lodash:679591c564c3bffaae8454cf0b3df370c3d6911c).at OR (npm:lodash@4.17.21).at

3. On Purpose/Goal

" For our purposes, we are only considering exported functions in an npm package, so we’re hoping to avoid the complexity of defining a universal specification for all of JavaScript"

To my knowledge, there is no such thing as "exported functions in an npm package". An "npm package" is, a poorly defined container, often used to distribute code. Because npmjs.com (aka. the Public Registry) does not enforce any kind of validation, the contents of a package can contain anything & only fails to publish when the package does not have a package.json. There is no validation of package.json beyond it's existence. This means there is no standard way of exporting a package's API & it is up to the consumer of the package to interpret its use.

This is a long way of saying, you cannot avoid "defining a universal specification for all of ~~JavaScript~~ the potential languages & runtimes a package might contain" as they are all relevant & potential entry points to potentially exploit. Vulnerabilities filed against an "npm package" may not be for/in JavaScript - so the "call stack" resolution (1.1) may also differ.

I'm sure there's other considerations to this work (ex. looking into how stack graphs are made & seeing the current state of tree-sitter/tree-sitter-javascript makes me a bit nervous about the accuracy of the call stack) but I'll leave it at this.

wesleytodd commented 1 year ago

I'm not sure what npm_package will represent. As others already noted, the "default" export of a package could be anything.

I think all the issues here boil down to: relying on the advisory reporting side to build an exhaustive list of the many paths at which a function can be imported from an npm package is a near impossible to solve task.

I think we need three things to reliably identify where an advisory points:

the package coordinate (as @darcyclarke said, what does npm_package represent)
the canonical source location where the method is exported
the token(s) which it is exported as

For number 1, we need to account for alternative registries, aliasing, and a few other complexities.

For number 2, I just can't think of any way we can get a canonical definition of where it is exported in a maintainable way without the filepath. But if we think we need that, then we need to enumerate all the the import locations, which I think is so similar to just using the filepath relative to the package root that I fail to see the difference.

Number 3 should be fairly straight forward, but if we want to account for things like dynamic exports with CJS then it might get rather complicated.

darakian commented 1 year ago

what does npm_package represent

To clarify on that point I had been (implicitly 🤦) using it to refer to the namespace owned on npmjs.com eg. lodash -> https://www.npmjs.com/package/lodash @types/lodash -> https://www.npmjs.com/package/@types/lodash

But you both bring up a a good point about supporting other package registries as well. A simple extension to that could be something like <registry>(npm_package).path.to.thing, but maybe that's starting to get ugly. Another consideration is that each registry might have its own pathing syntax. I'm not sure if other registries are commonly used in ways which npm is not or vica versa.

@darcyclarke I'm not sure I agree on this point.

I think its probably best to remove the package_name altogether & ensure you pass around the explicit "call graph" as referenced in 1.1.

While that would be an improvement in technical accuracy we need to ensure that humans can provide this data and asking humans to understand and edit call graphs is a tall ask. I will be one of these humans I do not feel like I'm well equipped to handle that task. Perhaps a call graph for a specific version is some companion data which could be constructed and attached to the advisory payload, but we need a human viable route to manage this data.

re:

1.5. 🤔 "An affected function name should not be location (in a file) based (e.g. row, source, column)" why?

Our advisory data covers ranges of vulnerable packages and row, source, column are generally not as stable as APIs across version ranges. For what it's worth we do tend to also provide fix commits when we can find them which would allow someone to derive the file based location data.

Would it make sense to re-frame the question from what is a canonical way to reference an object in an npm package to what are some ways by which we can reference an object in an npm package? For the purposes of bootstrapping data gathering we don't necessarily need a single way to refer to everything.

ljharb commented 1 year ago

For a package identifier, a PURL probably makes sense.

darakian commented 1 year ago

For a package identifier, a PURL probably makes sense.

You mean for handling multiple package registries? If so, purl only has an npm purl type (for js stuff) at the moment, so that would need to be expanded on a per-registry basis as desired.

Stepping back from package identification though; function data is subordinate data on an advisory which would identify the package itself.

BekaValentine commented 12 months ago

Any reference to call graphs is probably erroneous. Call graphs don't relate to the names of the callable items. Access paths certainly are relevant, of course.

darakian commented 11 months ago

Hey all, popping back in to ask another question or two 😅

Within the context of npm modules do you find Foo.prototype.bar as an intuitive name for instance methods of a class? On the same thread could a static method then be defined as Foo.bar? This would reserve the first level in the dot separated namespace of a class for static methods. Happy to hear alternative syntaxes if you got them 😄

Minimal example

class Foo {
  static bar() {console.log("test1")}
  bar() {console.log("test2")}
  prototype() {console.log("test3")}
}

Foo.bar points to the static bar doing a console.log("test1") Foo.prototype.bar points at the non-static bar doing a console.log("test2") Foo.prototype.prototype points at the non-static prototype method doing a console.log("test3")

If Foo were a top level class of an npm module baz we would have paths (baz).Foo.bar, (baz).Foo.prototype.bar etc...

wesleytodd commented 11 months ago

do you find Foo.prototype.bar as an intuitive name for instance methods of a class?

Yes.

If Foo were a top level class of an npm module baz we would have paths (baz).Foo.bar, (baz).Foo.prototype.bar

We still need to establish that baz is not a canonical definition for an npm package. I know that was not today's question, but since it was referenced again I wanted to make sure the points from @darcyclarke above are addressed

darakian commented 11 months ago

We still need to establish that baz is not a canonical definition for an npm package.

Not sure I follow you on this point. I attempted to clarify what I meant by npm package here. Could you expand on how you disagree with that package naming?

wesleytodd commented 11 months ago

Sorry, maybe we were all talking past eachother a bit. @darcyclarke addressed that in his comment above:

It's already a problem today that npm is a catch-all term used for the JavaScript package ecosystem when there's actually alternative package managers whom interpret the package names/specifiers differently.

to which you responded:

A simple extension to that could be something like (npm_package).path.to.thing

I believe that the registry url probably has to be included, but I would go revisit the things said in Darcy's comment under the "proposed grammar" section.

darakian commented 11 months ago

Ah I see what you're getting at. So, I disagree with the registry url being required for our use case where our advisory payload will include an affected product baz in our npm ecosystem which would define the location of the package. https://github.com/github/advisory-database/#supported-ecosystems npm for us means exclusively npmjs.com

The function syntax I'm trying to get feedback on would be subordinate data to an affected product and the (npm_package) part of that is necessary to capture the case where the package root is itself the vulnerable component and to differentiate from the case where there are no functions (or no known functions at least).

So to expand on my comment here The function syntax should (I think) be subordinate to a package location syntax and the location syntax can be swapped out as desired.

wesleytodd commented 11 months ago

So what happens when a project hosted on Github is loading packages form a different registry? Or has a git based package spec? Or one of the other things not covered by the import identifier? Do y'all just not report on it?

EDIT: yeah I guess that is the case. Maybe we were all confused on your scope.

ljharb commented 11 months ago

Is github packages not considered part of the npm ecosystem by github?

darakian commented 11 months ago

Correct. Only the packages on npmjs.com are considered as part of the npm ecosystem. GitHub packages is an ecosystem we've discussed but so far nothing there. I'd love to chat about that one with you out of band if it's something you'd like to see though.

darakian commented 10 months ago

Ok, since things have died down a bit here can we get some feedback on the following? We've come up with what we think is a workable format to move forward with.

The general syntax for function names is described as:

(entry_specifier).path.to.exported_function

Broadly speaking, the name consists of two parts. The part in parentheses identifies a particular JavaScript file. The part after the parentheses identifies a definition within that file.

For our part, we are only concerned about definitions that are exported from a published npm package. Therefore, the entry_specifier will always identify an npm package, and the path.to.exported_function will identify one of the definitions exported from the package’s main file:

(@package_scope?/package_name).path.to.exported_function

The basic Extended Backus–Naur (EBNF) form can be represented as:

::= "(" ("@" "/")? ")" ("." )*

It is very common in JavaScript packages for functions to be defined in “nested” or “inner” files, instead of being defined directly in the package’s main file. The package then uses a series of “re-exports” to make those definitions available to downstream users of the package. Definitions that follow this pattern could conceivably have more than one name: the names that are used internal to the package to identify the actual file that holds the definition, and whatever name it has within that file; and also the name that is actually exposed to downstream users in the package’s main file. For GitHub's part, we are only concerned with the final exported names, since that is how downstream users will have access to the definition and these are what will live on our advisories.

Token	Description	Required?
The npm package scope	No, not all packages are required to be scoped.	[a-z0-9-\~][a-z0-9-._\~]*
The npm package name without scope	Yes	[a-z0-9-\~][a-z0-9-._\~]*
The hierarchical set of 0-n members or functions that represent the exported public function path.	No	Function identifiers are specified according to ECMAScript standard names

In order to differentiate static methods from instance methods we also propose a dual-level namespace with .[static-method] and also, separately prototype.[any-instance-methods] so that if a class has both a static and instance method with the same name they may be differentiated.

A few examples: Using the (@package_scope?/package_name) we can capture single anonymous function exports such as in the package minimist eg. (minimist).

For deeper calls into the package will append the function path rooted at the entry point. For example the package tar has a number of top level exports. One of the exports, Unpack, would be represented as (tar).Unpack. The package protobufjs has a top level export util which has in it the function setProperty which would be represented as (protobufjs).util.setProperty.

ljharb commented 10 months ago

Again, an npm package has N - read, infinite - file paths that can lead into it. In other words, foo@1.2.3 is a package/version specifier, but someone could be importing/requiring foo, or foo/a, or foo/b, or any of a billion other paths. Additionally, while CJS may only ever export one value (where a JS object navigation path would make sense), an ESM file could export a default, or any of an infinite number of names, each of which could be any JS value.

In other words, your format is profoundly unworkable and inapplicable to the npm JS ecosystem, and there's tons of feedback on this very thread explaining why that's the case.

wesleytodd commented 10 months ago

I am wondering if it would be helpful if we flipped this around? Is there a format we could propose which would uniquely identify it? I would need to re-read the whole thread (which I don't have time for right now) but did we go over the pros/cons of using a relative file path from the package root? Wouldn't that solve this in that we would then just resolve the module with node's module resolution algo to see if the import did in fact map to the impacted file?

darakian commented 10 months ago

The primary pro from our point of view is that when reading someone's source code there are import statements and paths from the import that can be read. We think that producing paths of the sort proposed here could aid scanning tools which are trying to answer the question "is this codebase affected by vuln X".

but someone could be importing/requiring foo, or foo/a, or foo/b, or any of a billion other paths.

I think in my mind each of those foo/a, foo/bs would be a new entry point. Maybe there can be billions of them, but I think those do fit the format.

In other words, your format is profoundly unworkable and inapplicable to the npm JS ecosystem, and there's tons of feedback on this very thread explaining why that's the case.

I have tried to reply to the feedback as best I can.

ljharb commented 10 months ago

@darakian yes but the version is tied to the foo, not the foo/a. certainly you might have (${packageName}@${version}${optionalPath}).path.to.value, and that would work for CJS (which is most of npm), but that would not work for ESM, since there's nothing there to specify whether the import is default or a name. Additionally, the condition name in the exports field may apply (for both CJS and ESM) and so it needs to be included as well for any package that has exports (and please do note, the exports field is recursive so you may have any number of conditions or array indexes that are traversed before settling on one).

darakian commented 10 months ago

Would it suffice to allow the entry point to contain a path? Eg. (@some_scope/some_pkg/some/path/from/pkg/root)? Call paths off that entry point would still be of the form .some.call.path, but any file in the package could be used as a root so that we can capture paths not exposed from the default root of the package.

The version information would be contained in the advisory payload to which this function information would belong.

ljharb commented 10 months ago

As mentioned, that still wouldn’t work for ESM, which has 0 to N export names that a value could be available under.

darakian commented 10 months ago

Sorry for the late reply. For the moment 0 export names is out of scope for us. Maybe that's something we can address in a second version of the spec. For the 1 to N export names I think the spec we're proposing can capture them. We're not proposing a system to de-duplicate reexports but only to name exports. In the context of a security advisory if a vulnerable piece of code is accessible via two or more paths we would list all relevant paths.

openjs-foundation / security-collab-space