Standardising on a format for describing affected functions.

oliverchang commented 1 year ago

Context: https://github.com/ossf/osv-schema/issues/127 Summary: Many ecosystems/databases are starting to encode more fine grained, "vulnerable symbol" information in advisories.

Go has defined their own ecosystem_specific format for describing affected functions: https://go.dev/security/vuln/database#schema. Most vulns in the Go vuln DB have this.

RustSec has this in their security advisories: https://github.com/rustsec/advisory-db/blob/ee74d43ec2c6203750e38a254f4029f6181c4362/crates/RUSTSEC-2021-0027.json#L30

GHSA also has their own: https://github.com/github/advisory-database/blob/046aafb21a71eadbe8bdd5a72b1491a5f1209fb7/advisories/github-reviewed/2023/07/GHSA-57fc-8q82-gfp3/GHSA-57fc-8q82-gfp3.json#L25

Given that ecosystems have their own nuances for describing these accurately (e.g. OS, arch specifiers), it makes more sense for ecosystems to each define their own ecosystem_specific field. We should define an ecosystem_specific for Python as well as part of this DB.

oliverchang commented 1 year ago

@di @sethmlarson thoughts?

sethmlarson commented 1 year ago

My concern is I'm not sure how we'd get that information automatically, it'd be a manual process for each advisory. Then we'd have to teach pip-audit how to parse ASTs to apply it? cc @woodruffw

oliverchang commented 1 year ago

My concern is I'm not sure how we'd get that information automatically, it'd be a manual process for each advisory. Then we'd have to teach pip-audit how to parse ASTs to apply it? cc @woodruffw

Yes, it'd be manual for now, but I can imagine automation helping out with various pieces here, based on identifying fix commits and extracting the list of changed functions as a rough starting point.

We've been building support for Go (via govulncheck), and Rust in osv-scanner: https://github.com/google/osv-scanner/issues/476, so there is some precendent there in other ecosystems.

woodruffw commented 1 year ago

We should define an ecosystem_specific for Python as well as part of this DB.

Yes!

In particular: in addition to being able to relay vulnerable symbols/functions, one thing that would be valuable for Python specifically is being able to signal the kind of distribution that the vulnerability is relevant for.

This has come up a few times in the past, particularly with PyCA Cryptography: they've released security advisories that pertain only to wheel distributions (since those distributions include a static build of OpenSSL), and it'd be great if non-wheel users weren't unnecessarily included in those advisories.

CC @alex (I could have sworn he opened an issue for this somewhere before, but I can't immediately find it)

alex commented 1 year ago

I don't remember if I filed a bug, or just groused a bunch :-)

Tracking affected functions makes sense to me, though making use of that information may be more challenging than in a statically typed language.

woodruffw commented 1 year ago

Tracking affected functions makes sense to me, though making use of that information may be more challenging than in a statically typed language.

To flag a few (not exhaustive) examples of this:

Python modules and packages can re-export from other modules and packages (including third-party ones)
Decorators do "interesting" things to function and class names if they're done manually rather than via functools (specifically, they result in functions and classes having unexpected <locals> qualnames due to being defined dynamically within a function or other callable)

I wouldn't expect a standard format here to solve either of these perfectly (they're fundamentally too dynamic), but both are worth making considerations around. In particular, for re-exports, it's probably worth nailing down whether all re-exports (potentially many!) will be listed for a vulnerable package, or whether the single original site of export gets listed (which itself might be funky, since it could be a "private"-to-public re-export).

itamarsher commented 1 year ago

Hi everyone, my name is Itamar Sher and this is my first time commenting here. So hello 👋 We’re identifying the affected function for vulnerabilities we’re patching in our public vulnerability patches database ( https://github.com/seal-community/patches). I planned to go via the regular contribution channel but since I noticed this discussion I thought it might be a good opportunity to collaborate on how we can automatically share/feed this information back into the advisory without encumbering the maintainers.

darakian commented 1 year ago

Python has a spec for a fully qualified name (from Python 3.3) https://peps.python.org/pep-3155/ I haven't dug too deep into it but it appears to be a walk down the AST for any given class. eg.

>>> class C:
...   def f(): pass
...   class D:
...     def g(): pass
...
>>> C.__qualname__
'C'
>>> C.f.__qualname__
'C.f'
>>> C.D.__qualname__
'C.D'
>>> C.D.g.__qualname__
'C.D.g'

The one thing I would suggest be added to that form is that packages be rooted with their pep 503 normalized name https://peps.python.org/pep-0503/#normalized-names for the package. eg. packages foo and bar each with a top level function (or class) parse would have that function (or class) referenced as foo.parse and bar.parse

woodruffw commented 1 year ago

Yeah, IMO the full __qualname__ is a reasonable starting point.

The one thing I would suggest be added to that form is that packages be rooted with their pep 503 normalized name https://peps.python.org/pep-0503/#normalized-names for the package

I might be misunderstanding, but I don't think this will be necessary: the OSV finding itself will contain the package name, so including it in the path will be redundant (as well as confusing, since the package name isn't guaranteed to be a valid Python identifier or the same as the top-level module name).

This also leaves open the re-export and decorator/locals issues, but I think a 99% solution is probably better to get started with 🙂

darakian commented 1 year ago

I might be misunderstanding, but I don't think this will be necessary: the OSV finding itself will contain the package name, so including it in the path will be redundant

I think we do need a root of some sort to capture the case where the package is a single object/function. Eg. Some package foo where a user would use foo as

import foo

foo(some input)

package name isn't guaranteed to be a valid Python identifier or the same as the top-level module name

Can you expand on this? I think maybe I've confused package name with top level identifier. I was under the impression that the 503 package name was the top level identifier, but perhaps that's just a commonality and not a rule. If I'm in error there then yes maybe the top level identifier is the correct approach. Is there a pep for this?

woodruffw commented 1 year ago

Can you expand on this? I think maybe I've confused package name with top level identifier. I was under the impression that the 503 package name was the top level identifier, but perhaps that's just a commonality and not a rule. If I'm in error there then yes maybe the top level identifier is the correct approach. Is there a pep for this?

Correct: it's just a convention, not a rule. The two technically don't have to be similar at all, and the normalized package name still isn't guaranteed to be a valid Python identifier.

A good canonical example of this is PIL and Pillow -- Pillow is a maintained fork of PIL, so it still uses import PIL as its top-level module despite being installed via pip install Pillow.

As far as I know, there's no actual PEP stating this anywhere, it's just a quirk of how Python packaging works 🙂

As a result of this, a __qualname__ like PIL.foo.bar.baz may ambiguously refer to either the PIL package or the Pillow one, but the surrounding OSV content (including the package name) provides the disambiguation for us. So in theory just the __qualname__ should be sufficient, since it's understood it's coming from a specific package.

darakian commented 1 year ago

Ah yes, pillow. Good example 👍

I guess the question then is; when reading code how does one determine what the top level module is? I assume that's discoverable and that there's only one top level module.

So in theory just the qualname should be sufficient, since it's understood it's coming from a specific package.

Just to clarify; __qualname__ should include the top level module correct?

woodruffw commented 1 year ago

I guess the question then is; when reading code how does one determine what the top level module is? I assume that's discoverable and that there's only one top level module.

Yeah, the discovery process for matching a vulnerability to an environment (or source code) would roughly be:

Confirm that the top-level package in the OSV finding is present
Iterate through all statically discoverable imports in the target, and normalize them (handling import aliases, from ... import ..., etc.)
Compare each fully normalized import's members against all fully qualified paths in the OSV finding

Just to clarify; qualname should include the top level module correct?

Hmm, I thought it did, but from a quick look __qualname__ only shows the top-level name for a scope, rather than the full module resolution path. Using a random example from my host:

>>> from pip_audit import _cli
>>> _cli.AuditOptions.__qualname__
'AuditOptions'

So we'd need to go a little beyond __qualname__, and also probably use __module__ and similar to build up the full path:

>>> _cli.AuditOptions.__module__
'pip_audit._audit'

(this has the nice effect of also canonicalizing the fully qualified name, since AuditOptions isn't actually defined in _cli, just imported into it).

darakian commented 1 year ago

So would this be a "simple" concatenation of __module__ and __qualname__ or a more complex operation? I'm not following where _audit came from in your example and I'm quite (completely) ignorant of the code inside that pip_audit module.

woodruffw commented 1 year ago

So would this be a "simple" concatenation of module and qualname or a more complex operation?

Yep, it should be "just" that 99% of the time (famous last words). It gets a little bit more complicated when the target is a module or package itself (perhaps it should never be, as a matter of policy?) but even in those cases it should be just different concatenations of __name__, etc.

Sorry for the confusion with my example 😅 -- _audit is another module under pip_audit, so I was trying to demonstrate that a member imported from one module (pip_audit._cli) can actually have a fully qualified name under another module (pip_audit._audit). That was mostly to demonstrate the complexity of resolution, e.g. a user might do this in their code:

from pip_audit import _cli

_cli.AuditOptions(...)

and we would want to catch that, despite the "canonical" path for AuditOptions being pip_audit._audit.AuditOptions not pip_audit._cli.AuditOptions.

darakian commented 1 year ago

Yep, it should be "just" that 99% of the time (famous last words).

So foo.__module__ + foo.__qualname__? Also, I did a quick look for the source of these two functions and couldn't get at them quickly. I'll try to dig them up and link them here when I get some more time to dig. I don't really wanna be in a world where the spec is basically whatever these two functions spit out

even in those cases it should be just different concatenations of name, etc.

We should iterate out those cases 😄

On your example; I guess AuditOptions can be pulled in from _cli in practice then it's available as a re-export or something? eg. _cli pulls it in for whatever reason and a user could get access through that module? I agree that we should stay focused on a canonical path and maybe this example brings up a question of how to capture aliases for a path, but maybe that's a question to resolve in v2.

So, then my question on AuditOptions is; what makes pip_audit._audit.AuditOptions the canonical path and how would I make that determination for a different project?

woodruffw commented 1 year ago

So foo.module + foo.qualname?

Yep, 99% of the time. There are some items in Python that don't have a __module__ or __qualname__ (like module objects themselves); these need to use __name__ instead (and maybe some other things that I'm forgetting?).

It'd be good to check what Python actually says about those attributes, but it's also worth noting that they're fundamentally dynamic anyways: a misbehaving class can do __module__ shenanigans, and can rewrite its own __qualname__. There's very little Python will probably actually guarantee about these 🙂

On your example; I guess AuditOptions can be pulled in from _cli in practice then it's available as a re-export or something? eg. _cli pulls it in for whatever reason and a user could get access through that module?

Yep, exactly. In practice, Python's lack of module visibility or explicit exporting means that any internal use of a package by itself results in re-exports that users can accidentally depend on.

So, then my question on AuditOptions is; what makes pip_audit._audit.AuditOptions the canonical path and how would I make that determination for a different project?

This is, unfortunately, a good question with no good answer 🙂

As far as I know, Python's module system has no actual definition of "canonical" for a particular resolution path. As such, a tool/ecosystem like OSV will need to make its own determinations about how best to canonicalize paths. A few options:

Always favor the site of definition. In other words: whichever module class Foo appears in is the canonical module, even if Foo gets re-exported elsewhere.
- Problems: Lots of packages use the "re-export to public" pattern for public APIs, meaning that we'll end up "canonicalizing" paths that no user should ever actually import (e.g. somepkg._internal.Foo rather than somepkg.public.Foo).
Follow first-party re-exports and __all__.
- Problems: Can cause duplicates; unclear how to tiebreak between them. Also unclear what to do about wildcards.
Don't bother canonicalizing; list every definition path and first-party re-export.
- Problems: Extremely noisy; will include lots of paths that are "valid" import paths but that no user will ever actually import.

(There are also further problems, like packages that have canonical paths for public APIs but intentionally use __getattr__ to force dynamic lookups as part of their deprecation process. Pydantic is a good example of this.)

darakian commented 1 year ago

It'd be good to check what Python actually says about those attributes, but it's also worth noting that they're fundamentally dynamic anyways: a misbehaving class can do __module__ shenanigans, and can rewrite its own __qualname__. There's very little Python will probably actually guarantee about these 🙂

Oh for sure. I'm not worried about modules that are intentionally obtuse for a v1 spec. 99% will get us a lot of value for most users and we can tackle the infinite regression of edge cases for v2 (or 3 maybe 😄)

But yes, lets formalize __module__ + __qualname__ as more than just the output of the functions and if that's straight forward we can probably start moving on that 🤞

This is, unfortunately, a good question with no good answer 🙂

:(

As such, a tool/~ecosystem~ format like OSV will need to make its own determinations about how best to canonicalize paths.

I guess for the moment we could also say that it's a data providers obligation to do a best effort canonicalization. There will be aliases in practice of course, but maybe that's an acceptable choice. I tend to prefer to match whatever a user of a package would use so that scanning tools can be more easily maintained/made to be lower noise so if there's a re-export pattern in some library but also the developer documentation advises users to use that re-export then we prefer that export. That's human guidance I know.

woodruffw commented 1 year ago

But yes, lets formalize module + qualname as more than just the output of the functions and if that's straight forward we can probably start moving on that 🤞

Sounds good! As a starting point: we can say that these names take the form:

module1.module2.module3:attribute

...where the LHS of the : is an absolute importable module path, i.e. a value that could be passed into importlib.import_module(...) without needing a relative package reference to resolve against.

The RHS is a gettable attribute within the importable module, meaning that getattr(module3, attribute) is expected to not raise AttributeError (although this isn't guaranteed).

The reason I suggest : as the attribute delimiter is for consistency with other bits of the ecosystem: it's what setuptools and other packaging tooling use for entry_points syntax, and it also shows up in Pyramid and a few other frameworks. So there's precedent for it, and it maintains separation between the "importable component" and "gettable component."

When an entire module is meant to be flagged in a finding, we can probably just drop the :attribute, e.g.:

module1.module2.module3

...which follows the same importlib.import_module(...) rules.

I guess for the moment we could also say that it's a data providers obligation to do a best effort canonicalization. There will be aliases in practice of course, but maybe that's an acceptable choice. I tend to prefer to match whatever a user of a package would use so that scanning tools can be more easily maintained/made to be lower noise so if there's a re-export pattern in some library but also the developer documentation advises users to use that re-export then we prefer that export.

That seems very reasonable to me. There isn't a perfect answer here, so I think anything (like this) that gets us to the 99% case is perfectly suitable 🙂

darakian commented 1 year ago

Any chance you can give some examples of the : usage? Not sure I follow on what you're suggesting there.

woodruffw commented 1 year ago

Any chance you can give some examples of the : usage? Not sure I follow on what you're suggesting there.

Sure: using the example above:

pip_audit._audit.AuditOptions

would become:

pip_audit._audit:AuditOptions

On the other hand, a report for an entire module would remain the same:

foo.bar.baz

That's really the only change; the main reasons I suggest it are because it's what other tools in the Python packaging ecosystem do, and because it disambiguates different parts of the path well.

darakian commented 1 year ago

Assuming we're going off https://github.com/pypa/pip-audit/blob/main/pip_audit/_audit.py for your example AuditOptions is a class. I guess I'm not following why that would change. One could still import pip_audit._audit.AuditOptions correct?

woodruffw commented 1 year ago

One could still import pip_audit._audit.AuditOptions correct?

Not through the importlib machinery:

>>> import importlib
>>> importlib.import_module("pip_audit._audit.AuditOptions")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/william/.pyenv/versions/3.11.2/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1137, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'pip_audit._audit.AuditOptions'; 'pip_audit._audit' is not a package
>>> importlib.import_module("pip_audit._audit")
<module 'pip_audit._audit' from '/Users/william/devel/pip-audit/pip_audit/_audit.py'>

Similarly with the import keyword:

>>> import pip_audit._audit.AuditOptions
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pip_audit._audit.AuditOptions'; 'pip_audit._audit' is not a package

That's why the disambiguation with : is nice: it makes it clear which parts can actually be imported (which a triaging tool might need to do) versus which parts are attributes.

darakian commented 1 year ago

Ah ok, so the : wouldn't necessarily be a path indicator, but an indicator of how to interact with the object. I'm wondering how I would validate that and I'm not sure I can commit to supporting it reliably. Any thoughts on that end?

woodruffw commented 1 year ago

Ah ok, so the : wouldn't necessarily be a path indicator, but an indicator of how to interact with the object.

Right: it's an indicator that the thing on the RHS isn't a part of the path, but is rather a class, object, function, etc. (i.e. the things that are the actual vulnerable API surface, rather than the path to the vulnerable API surface).

I'm wondering how I would validate that and I'm not sure I can commit to supporting it reliably. Any thoughts on that end?

On one level, validating this is the same as it would be with the foo.bar.Baz syntax -- you just check to see if the input matches \w[\w\d]*(\.\w[\w\d])*(:\w[\w\d]*)? (very roughy).

If you're asking how you'd check that the LHS is a valid path and the RHS is a valid attribute: you can either do it dynamically with importlib + getattr or get 95% of the way there statically. But neither is 100% guaranteed to resolve all valid attributes under valid paths.

(This is true for the foo.bar.Baz syntax as well -- there's no generalized way to validate that a particular Python module/attribute path is "real" for all users. So I don't think this syntax has any advantages in that respect; the main advantages are that the Python community already uses it elsewhere and it's slightly less ambiguous about the RHS.)

darakian commented 1 year ago

If you're asking how you'd check that the LHS is a valid path and the RHS is a valid attribute:

This. Intuitively it seems easier to discover valid foo.bar.Baz paths than to know if the trailing element is an attribute or not. Knowing if something is an attribute or not seems like it would require much more familiarity with any given code base. I do agree it's less ambiguous, but if we can't reliably provide that data would foo.bar.Baz be an acceptable fallback?

woodruffw commented 1 year ago

Intuitively it seems easier to discover valid foo.bar.Baz paths than to know if the trailing element is an attribute or not.

Could you elaborate a bit more on what you mean by "discover valid"? In a strict sense, foo.bar.Baz and foo.bar:Baz are exactly as easy to discover and exactly as hard to validate; the benefit of the : is solely to disambiguate the actual API being identified versus a path to it (and because it's what the rest of the packaging ecosystem uses, as mentioned before).

For example, foo.bar.Baz could refer to any of the following things:

Constant Baz through foo.bar
Class Baz through foo.bar
Module Baz under foo.bar
Any (or multiple) of the above, but impossible to statically derive
Any of the above, as a re-export

And the same is true for foo.bar:Baz.

I think the TL;DR of my position is that Python's dynamic nature makes it almost impossible to actually "validate" anything here, whether you use : or . as the final delimiter -- the value of : for the attribute delimiter doesn't come from additional validation power, but instead from consistency with other Python tools and the intent it communicates 🙂.

darakian commented 1 year ago

Could you elaborate a bit more on what you mean by "discover valid"?

Sure. Part of my job is in populating this data and discovery of paths like foo.bar.Baz is its own complexity, but is something I've been able to do with reasonable consistency. The : qualifier would make the data more useful for sure, but I am not confident I would be able to support that reliably. So, I don't want to be in a position where I'm generating bad data for the rest of the world.

Now, if the : is optional it could certainly be the case that I primarily output foo.bar.Baz and users come and correct me over here. No doubt they'll come correct me for the dot separated strings too, but I guess I'm concerned about the expectation as I expect to fail if : is expected 😅

woodruffw commented 1 year ago

Thanks for explaining!

I understand your position, and I'm also very sympathetic to not adding more work to an already manual process 🙂. I think I might have given you the wrong impression about how difficult this is, by talking about pathological cases: in practice, 99.9% of paths that you discover as foo.bar.Baz can be automatically rewritten as foo.bar:Baz, even when Baz is something like a module itself.

That being said, I recognize that even that may be a dealbreaker for you. Ultimately any amount of path/API information will be beneficial here, so I'll stop hammering on : and just say that this is all good and I'm looking forward to any variant of it coming into existence 🙂

darakian commented 1 year ago

Alright, if you're comfortable with it I'd prefer to keep the spec location based (and slightly easier 😅) for v1 with just a dot separated path to describe the location of a vulnerable object.

I still need to get some time to go find the source for __module__ and __qualname__, but I think we're basically in agreement on what to do 🎉

darakian commented 1 year ago

Ok, so I may have 🎉 'd a bit early 😞

First off for __module__ and __qualname__ I can't actually find hard definitions. Best as I can tell both are PyObject types from the C codebase, but that's not super useful. Both funcobject.h and methodobject.h mention that /* The __module__ attribute, can be anything */ https://github.com/python/cpython/blob/main/Include/cpython/funcobject.h#L42 https://github.com/python/cpython/blob/main/Include/cpython/methodobject.h#L11

That said, In doing some testing to validate what we've been talking about and to ensure we have some good examples I came across a django case of (I think) a re-export which will likely matter for our use case. Using django 4.2.5 for the test.

(vea-testing-venv) jon~/V/vea-testing-venv❯❯❯ python
Python 3.11.6 (main, Oct  3 2023, 02:51:45) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from django.db.models import JSONField
>>> JSONField.__module__
'django.db.models.fields.json'
>>> JSONField.__qualname__
'JSONField'

Which seems to be down to this line in the django/blob/main/django/db/models/__init__.py file https://github.com/django/django/blob/main/django/db/models/__init__.py#L97 A number of other fields are exported in the same way and while the module + qualname does seem like a more canonical representation I'm left wondering how to bridge the gap so that a tool scanning code can connect that to django.db.models import JSONField / django.db.models.JSONField

Sorry for the delay on getting into the weeds here but any thoughts?

woodruffw commented 1 year ago

First off for __module__ and __qualname__ I can't actually find hard definitions. Best as I can tell both are PyObject types from the C codebase, but that's not super useful. Both funcobject.h and methodobject.h mention that /* The __module__ attribute, can be anything */

Yep, this is what I mentioned in https://github.com/pypa/advisory-database/issues/149#issuecomment-1741167885 -- Python will not make any guarantees about the values of these attributes, because (for CPython) they're just PyObjects that can point to anything (including their containing object, leading to reference cycles!)

Sorry for the delay on getting into the weeds here but any thoughts?

Unfortunately, I don't know of a great generalization for handling these kinds of canonical paths 😞 -- there's no particular guarantee that a given attribute has only one "canonical" path, much less that multiple modules don't use the __all__ mechanism for their re-exports.

I know I sound like a broken record at this point, but my recommendation here would be to sidestep the entire "what is a valid __module__/__qualname__" problem and think about this entirely in terms of things that Python does strongly define, like importlib.import_module's semantics. That won't get you the "canonical" path(s) for a given attribute, but it will avoid the shenanigans that occur, in practice, with __module__ and __qualname__.

For getting a potential (non-exhaustive) set of "canonical" paths, one possibility would be to use pkgutil.walk_packages to walk each module, look for an __all__ attribute, and match re-exports from there. But this has significant downsides: you'll need to run arbitrary module code (since it's all dynamic Python), and it's not guaranteed to work at all (since __all__ isn't required for re-exports, and in fact isn't commonly used by libraries that don't expect to be wildcard imported).

TL;DR: Whether you use foo.bar.Baz or foo.bar:Baz, I'd recommend treating the foo.bar part as an abstract "path" that obeys according to importlib.import_module's semantics, since that's (essentially) the only thing that Python will guarantee here. In terms of determining whether foo.bar.Baz is "canonical" in some sense, you can check whether foo.bar.__all__ contains "Baz" (but note the significant constraints above).

darakian commented 1 year ago

Yep, I think you're right and I just got wrapped up in my head 🤦

Ok, so we ignore the idea of canonical paths for the time being and focus instead on import path syntax. I guess first question; is there a name for that?

Going back to the django example; we would then have django.db.models.JSONField as derived from from django.db.models import JSONField which could also be represented as django.db.models.fields.json.JSONFieldderived from from django.db.models.fields.json import JSONField

and the two would be independent and valid paths?

woodruffw commented 1 year ago

Ok, so we ignore the idea of canonical paths for the time being and focus instead on import path syntax. I guess first question; is there a name for that?

I'm not aware of an official name for this; the importlib documentation sometimes calls it a "full path" or similar. I think you can just call it a "path" or "location" or similar.

There may be some additional guarantees documented under __spec__ and ModuleSpec (https://docs.python.org/3/library/importlib.html#importlib.machinery.ModuleSpec), but I don't think they'll directly help here.

and the two would be independent and valid paths?

Yep 🙂

darakian commented 1 year ago

I'm not aware of an official name for this

Hmmmm. Feature request there I guess :) Digging around the importlib code it looks like the there haven't been any big changes in over a decade so, probably pretty stable. I guess for now we can call them import paths or something.

Yep 🙂

Cool 👍

darakian commented 1 year ago

@woodruffw sorry for dropping off on my end. To keep this going, I think we have importlib paths rooted at the top level module of the form django.db.models.JSONField which express a path to an object in a given pypi package (eg. django). Multiple paths may point to the same object but all paths must be valid import paths. The paths may point to any object (class, function, constant, etc...) The collection of all paths for a given package could be considered the API for the package.

If all that makes sense and is agreeable I think the next question is; where does this live? Is this something to make a PR for in this repo or is there a better place?

woodruffw commented 1 year ago

No problem!

I think we have importlib paths rooted at the top level module of the form django.db.models.JSONField which express a path to an object in a given pypi package (eg. django).

To clarify:

django here is the top-level module, not the top level PyPI package. They often share the same name, but there's no particular guarantee that they do (cf. earlier discussion about PIL and Pillow. Python's importlib machinery has no concept of a PyPI package name 🙂
A path like django.db.models.JSONField isn't technically an importlib path -- the django.db.models part is that path, and JSONField is an arbitrary attribute that may be resolvable at the end of the path. If you pass django.db.models.JSONField into importlib, you'll get an error.

Sorry again if this comes across as pedantic, but I want to make sure we're using the same terms here. The rest makes sense to me!

where does this live? Is this something to make a PR for in this repo or is there a better place?

My first thought would be here, or in one of the OSV repositories (maybe they already document similar syntaxes for other ecosystems? CC @oliverchang). @sethmlarson may also have opinions 🙂

woodruffw commented 1 year ago

The collection of all paths for a given package could be considered the API for the package.

In terms of expressing this, I had another thought: rather than creating these pseudo-paths like django.db.models.JSONField, it might be better to devolve this at the format level. In pseudo-JSON:

{
  attribute: "JSONField",
  paths: ["django.db.models", "django.something.else"]
}

This is slightly more verbose upfront, but comes with a few benefits:

The increased size of this representation is amortized over time, versus having to duplicate django.db.models.JSONField, django.something.else.JSONField, etc.
The distinction between the path and the API located at the path is made clear.

sethmlarson commented 1 year ago

The information for affectedness @woodruffw works for me. Is there a reason you're using path as the key instead of module? For the format, Go appears to have the two swapped having multiple potential attributes per module which if we're only encoding the initial definition point having multiple attributes per module seems favorable? ie:

{
  "ecosystem_specific": {
    "affects": [
      {
        "attrs": ["JSONField"],
        "module": "django.db.models"
      }
    ]
  }
}

(example assumes that JSONField is imported into django.something.else and its up to tools to figure that out)

Also, are we expecting to fully enumerate all places a name is re-exported or is that the job of tooling to walk an import ast and determine affectedness (tools will need to do this anyways in order to determine affectedness from source files, so this doesn't feel like a huge additional burden for an easier experience encoding advisories?)

In terms of where do these choices live, we can have that live in a CONTRIBUTING.md file on this repository? Other OSV databases might have better ideas? The PSF Advisory Database will use whatever exact format we decide on too.

woodruffw commented 1 year ago

Is there a reason you're using path as the key instead of module?

Nope, module makes way more sense (and is consistent with what I've been yakking about).

Also, are we expecting to fully enumerate all places a name is re-exported or is that the job of tooling to walk an import ast and determine affectedness (tools will need to do this anyways in order to determine affectedness from source files, so this doesn't feel like a huge additional burden for an easier experience encoding advisories?)

My understanding is the latter, with the idea that the format can be a "best effort" source of initial information (e.g. if we know that a project has a few really popular re-exports, it probably makes sense to list them rather than having end user tooling have to do that lifting.)

sethmlarson commented 1 year ago

I'm ++ on best effort too, doesn't /have/ to be the original definition location if the API is defined a certain way (like requests.request() actually being defined elsewhere).

darakian commented 1 year ago

django here is the top-level module, not the top level PyPI package. They often share the same name, but there's no particular guarantee that they do (cf. earlier discussion about PIL and Pillow. Python's importlib machinery has no concept of a PyPI package name

Gah, sorry you're correct. What I meant by the package part is that the osv path data will be subordinate to an affected product which itself will be a pypi package. That's me conflating things though.

A path like django.db.models.JSONField isn't technically an importlib path -- the django.db.models part is that path, and JSONField is an arbitrary attribute that may be resolvable at the end of the path. If you pass

I do like the idea of attributes, but again I really question my ability to provide those with high fidelity. 😞 Perhaps that's something I simply have to deal with though....

re:

{
  attribute: "JSONField",
  modules: ["django.db.models", "django.something.else"]
}

I do like that as an approach to method to de-dupe paths to an affected code point. I wanna double check that the statement The collection of all paths for a given package could be considered the API for the package. still works for you though. Using the example before this would deconstruct to django.db.models.JSONField && django.something.else.JSONField and the collection of all possible paths to objects could be considered an API for the ~package~ top level module.

oliverchang commented 1 year ago

No problem!

I think we have importlib paths rooted at the top level module of the form django.db.models.JSONField which express a path to an object in a given pypi package (eg. django).

To clarify:

django here is the top-level module, not the top level PyPI package. They often share the same name, but there's no particular guarantee that they do (cf. earlier discussion about PIL and Pillow. Python's importlib machinery has no concept of a PyPI package name 🙂

A path like django.db.models.JSONField isn't technically an importlib path -- the django.db.models part is that path, and JSONField is an arbitrary attribute that may be resolvable at the end of the path. If you pass django.db.models.JSONField into importlib, you'll get an error.

Sorry again if this comes across as pedantic, but I want to make sure we're using the same terms here. The rest makes sense to me!

where does this live? Is this something to make a PR for in this repo or is there a better place?

My first thought would be here, or in one of the OSV repositories (maybe they already document similar syntaxes for other ecosystems? CC @oliverchang). @sethmlarson may also have opinions 🙂

I think it would make sense for this to belong in this DB, and the OSV schema can link to here as the authoritative source of this. e.g. the OSV schema links to https://go.dev/security/vuln/database for the Go-specific bits (including how they identify vulnerable symbols).

woodruffw commented 1 year ago

I do like the idea of attributes, but again I really question my ability to provide those with high fidelity. 😞

Sorry if you've already explained this upthread, but I think this should really just be a mechanical transform on your end. In other words, if you have the ability to present identifiers of the form foo.bar.Baz, then presenting the "module and attribute" form should be no more complicated than:

mod_path, attr_name = "foo.bar.Baz".rsplit(".", 1)
json.dumps({"module": [mod_path], "attribute": attr_name})

In other words: presenting it in this form doesn't imply anything different about the data quality; it's just a way to de-dupe.

I wanna double check that the statement The collection of all paths for a given package could be considered the API for the package. still works for you though. Using the example before this would deconstruct to django.db.models.JSONField && django.something.else.JSONField and the collection of all possible paths to objects could be considered an API for the package top level module.

Yep, that sounds good (and correct) to me!

darakian commented 1 year ago

Sorry if you've already explained this upthread, but I think this should really just be a mechanical transform on your end. In other words, if you have the ability to present identifiers of the form foo.bar.Baz, then presenting the "module and attribute" form should be no more complicated than: ...

It's mostly my own apprehension and fear honestly :) I'm willing to trust though and sign on with the idea that we should annotate attributes on modules 👍

That said I will pester you with a few questions too 😉 So, for the fix commit https://github.com/python-pillow/Pillow/commit/1fe1bb49c452b0318cad12ea9d97c3bef188e9a7 A max length value is added to PIL.ImageFont

>>> import PIL
>>> PIL.ImageFont
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'PIL' has no attribute 'ImageFont'
>>> from PIL import ImageFont
>>> ImageFont
<module 'PIL.ImageFont' from '/Users/jon/venv/lib/python3.11/site-packages/PIL/ImageFont.py'>
>>>

I think PIL.ImageFont is the module in question with MAX_STRING_LENGTH as a new attribute However I would say in english that PIL.ImageFont is the affected component and with the format described above I would store that as

{
  attribute: "",
  modules: ["PIL.ImageFont"]
}

Where I think the lack of attributes can be read as a lack of refinement on the look up. eg. if no attribute is listed then assume any/all. Deeper analysis could refine that further, but I want to be sure that this read of the information is correct since a lot of the discovery work is so manual and hence time constrained.

woodruffw commented 1 year ago

It's mostly my own apprehension and fear honestly :) I'm willing to trust though and sign on with the idea that we should annotate attributes on modules 👍

Very understandable! I think the best way to think about this is as a purely mechanical operation, so there should be no additional context or complexity on the producer side.

I think PIL.ImageFont is the module in question with MAX_STRING_LENGTH as a new attribute However I would say in english that PIL.ImageFont is the affected component

In that case, I would express that as:

{
  attribute: "ImageFont",
  modules: ["PIL"]
}

...since a module (like ImageFont) is an attribute. That should hopefully emphasize what I mean by "purely mechanical": any path like foo.bar.Baz can be transformed into this representation without any additional context or loss of generality/information 🙂

darakian commented 1 year ago

That does help quite a bit. I missed that modules are attributes. That clears up a lot on my end and I agree with the OSV style representation too. Do we want to say also that PIL:ImageFont would be an equivalent representation? Also, since I think we're basically aligned I'd love to get a PR going. Is there a specific file I should open a PR against? CONTRIBUTING.md as suggested above?

woodruffw commented 1 year ago

Do we want to say also that PIL:ImageFont would be an equivalent representation?

Yep! That was what I was sort of going for with the original path:attribute thing, but I'm glad we actually settled on this "many modules, one attribute" JSON object layout instead 🙂

Is there a specific file I should open a PR against? CONTRIBUTING.md as suggested above?

That makes sense to me -- maybe @sethmlarson has opinions here as well.

darakian commented 1 year ago

Yep! That was what I was sort of going for with the original path:attribute thing, but I'm glad we actually settled on this "many modules, one attribute" JSON object layout instead 🙂

Just to be 100% clear I'm suggesting both where the many modules, one attribute would essentially be a short form. I can just see the single line form being useful in a lot of more conversational contexts as well as possible in some tooling settings.

woodruffw commented 1 year ago

Just to be 100% clear I'm suggesting both where the many modules, one attribute would essentially be a short form. I can just see the single line form being useful in a lot of more conversational contexts as well as possible in some tooling settings.

Gotcha, makes sense to me!

pypa / advisory-database

Standardising on a format for describing affected functions. #149