oliverchang commented 1 year ago

Go, GitHub, and Rust all define their own way of specifying affected functions:

We should define a unified field for expressing these in a way that works for all languages and their nuances.

oliverchang commented 1 year ago

@darakian FYI! filing this issue as promised for starting discussion.

darakian commented 1 year ago

Many thanks for starting the thread! If we want this to be a single unified field then I think a first step would be to define that field as typed with possible type values of whatever language is supported. Two reasons for that in my mind

Each language is going to have different rules ex. function overloading
Packages in the different ecosystems can contain more than one language ex. Tensorflow on pypi and commonmarker on rubygems

A second point to make is that ideally we should aim to capture more than functions. I can dig up an example if you like but I've come across a number of redos vulns which affect some variable which exists outside a function body and the fix tends to be altering that variable. That said, if you'd like to keep this issue scoped to functions only then I'm happy to do so 😄

darakian commented 1 year ago

To pop this back onto the stack would it make sense to start with a minimal, functional format where we root the affected functions/constants/code points at the entry to the package. eg. a package foo with a function bar would be mapped as foo.bar or foo::bar I think this works for go as well, but please correct me if not.

The separator syntax is something I'm unsure if we can normalize or not, but perhaps for v1 we accept whatever the "norm" is for the ecosystem is. For language I think we can either define a default language per ecosystem or create a small list which covers the default for each. I have a slight preference for the latter since we will eventually need to type the languages anyway.

so, the proposal for a functions v1 would introduce values that look like python foo.bar.biz.baz rust blah::fu::fun:faz where the language type would define the path separator. This doesn't account for language versions, but I don't know of any examples of a language version change causing a change in code paths so I think we're good on that front.

Known gaps

Does not account for anonymous functions
Does not account for function over loading

Thoughts?

oliverchang commented 1 year ago

Sorry for the delay on the response! Wanted to give a proper one.

I think for some ecosystems, e.g. Go, there are additional considerations.

e.g. with Go, the function/symbol needs to be qualified with the package within the module (https://go.dev/security/vuln/database), and there's additional qualifiers like goos, goarch.

I think we should use whatever is canonical for the ecosystem, rather than try to normalize separators and introduce potential confusion. The question here though, is how we can generalize this in a way that works for all, while giving enough flexibility to express additional ecosystem_specific details.

Looking at the current formats out there today, there's Go, GHSA's affected_functions, and Rust Advisory DB:

Go

    "ecosystem_specific": {
        "imports": [
          {
            "path": "cmd/link",
            "goos": [
              "js"
            ],
            "goarch": [
              "wasm"
            ],
            "symbols": [
              "Link.address"
            ]
          }
        ]
      }

      "ecosystem_specific": {
        "imports": [
          {
            "path": "golang.org/x/text/encoding/unicode",
            "symbols": [
              "bomOverride.Transform",
              "utf16Decoder.Transform"
            ]
          },
          {
            "path": "golang.org/x/text/transform",
            "symbols": [
              "String"
            ]
          }
        ]
      }

Rust

      "ecosystem_specific": {
        "affects": {
          "arch": [],
          "os": [],
          "functions": [
            "lzf::compress",
            "lzf::decompress"
          ]
        }
      },

GHSA

      "ecosystem_specific": {
        "affected_functions": [
          "tokio::task::JoinHandle::abort"
        ]
      },

As a first attempt to standardize this: perhaps we can define something like:

"scope": [ 
  {
    "symbol": string, # language specific function name
    "os": string # language specific OS qualifier  
    "arch": string # language specific arch 
    "additional_qualifiers": {
      <anything goes here, e.g. Go package path > 
    }
  }
]

But then, this ends up being essentially equivalent to ecosystem_specific but more awkward to use.

Alternatively should we just aim to unify the definitions of all the ecosystem_specific fields in one place in the OSV schema to provide this functionality?

darakian commented 1 year ago

I think we should use whatever is canonical for the ecosystem, rather than try to normalize separators and introduce potential confusion.

100% agree.

The question here though, is how we can generalize this in a way that works for all, while giving enough flexibility to express additional ecosystem_specific details.

Stepping back a moment. We don't necessarily need to generalize a for all path. Perhaps it's better to add language support on a case by case basis to ensure we have quality data and don't start generating bad data which will exist forever™️

With respect to the generalized format you're suggesting; how would os and/or arch specifiers interact with the broader advisory payload if the broader payload disagreed with either? Do we need to iterate all os versions and uarch types? The additional qualifiers field feels like a free form field and I doubt we'd see standardized usage of it unless we nail down what it's for.

For the generalized case my $0.02 is that if we head down that path we start small and add on as need be, so I'd suggest we trim your suggestion down to the symbol part. Indeed it won't capture os and uarch specifics, but uarch and os specific advisories tend to be pretty rare (for us at github at least). Fwiw I think the way we render rust functions is the same as how rustsec does

https://rustsec.org/advisories/RUSTSEC-2021-0072.html https://github.com/github/advisory-database/blob/8ef032ef781e0d854fe1041744b90a8b89e8e792/advisories/github-reviewed/2021/08/GHSA-2grh-hm3w-w7hv/GHSA-2grh-hm3w-w7hv.json#L97

Edit: Thinking on the above a little more. Maybe it does make sense to include the os and arch but to make them optional and to interpret them missing as the generic case where all possible options are affected. I think we still need to deal with enumerating them, but maybe we can start the lists small.

darakian commented 1 year ago

Also just so we have them here are a few references on what some of the languages think a fully qualified function is https://peps.python.org/pep-3155/ https://docs.oracle.com/javase/specs/jls/se11/html/jls-6.html#jls-6.7 https://www.php.net/manual/en/language.namespaces.rules.php https://doc.rust-lang.org/reference/paths.html#qualified-paths

oliverchang commented 1 year ago

Thanks @darakian ! If we minimize the definition to just "symbols", I wonder if that will cause confusion with other similar data included in ecosystem_specific.

Even if we extend this hypothetical new field to be more flexible with ecosystem-specific fields, it becomes pretty much the same as existing ecosystem_specific field. I wonder if the best approach here is to just fully lean on ecosystem_specific to define these, because inevitably any client tooling will have to be customized based on the language/ecosystem.

i.e. we encourage ecosystems to clearly define their ecosystem_specific to capture vulnerable symbols, OS, arch etc similar to how Go already does it in https://go.dev/security/vuln/database. Then the OSV schema would link to these canonical definitions.

Many language ecosystems (e.g. Rust, Python, Haskell, R) already have canonical advisory-database repositories so potentially they can just live there.

darakian commented 1 year ago

If we minimize the definition to just "symbols", I wonder if that will cause confusion with other similar data included in ecosystem_specific.

Certainly if both paths are supported for a given ecosystem at the same time I could see that causing confusion. Perhaps it makes sense to cut over on an OSV schema version bump.

I wonder if the best approach here is to just fully lean on ecosystem_specific to define these, because inevitably any client tooling will have to be customized based on the language/ecosystem.

That's how I feel for sure. I think there's a question about what to do if languages end up being shared across ecosystems (C springs to mind), but generally I tend to want to lean on the expertise of those already involved with the specific domain. For Go specifically I have no problems leaning on their definitions.

oliverchang commented 1 year ago

I believe there's also https://github.com/openjs-foundation/security-collab-space/issues/66 for the JS discussion.

ossf / osv-schema

Unified field for describing affected functions #127

Go

Rust

GHSA