Open oliverchang opened 1 year ago
@di @sethmlarson thoughts?
My concern is I'm not sure how we'd get that information automatically, it'd be a manual process for each advisory. Then we'd have to teach pip-audit how to parse ASTs to apply it? cc @woodruffw
My concern is I'm not sure how we'd get that information automatically, it'd be a manual process for each advisory. Then we'd have to teach pip-audit how to parse ASTs to apply it? cc @woodruffw
Yes, it'd be manual for now, but I can imagine automation helping out with various pieces here, based on identifying fix commits and extracting the list of changed functions as a rough starting point.
We've been building support for Go (via govulncheck), and Rust in osv-scanner: https://github.com/google/osv-scanner/issues/476, so there is some precendent there in other ecosystems.
We should define an ecosystem_specific for Python as well as part of this DB.
Yes!
In particular: in addition to being able to relay vulnerable symbols/functions, one thing that would be valuable for Python specifically is being able to signal the kind of distribution that the vulnerability is relevant for.
This has come up a few times in the past, particularly with PyCA Cryptography: they've released security advisories that pertain only to wheel distributions (since those distributions include a static build of OpenSSL), and it'd be great if non-wheel users weren't unnecessarily included in those advisories.
CC @alex (I could have sworn he opened an issue for this somewhere before, but I can't immediately find it)
I don't remember if I filed a bug, or just groused a bunch :-)
Tracking affected functions makes sense to me, though making use of that information may be more challenging than in a statically typed language.
Tracking affected functions makes sense to me, though making use of that information may be more challenging than in a statically typed language.
To flag a few (not exhaustive) examples of this:
functools
(specifically, they result in functions and classes having unexpected <locals>
qualnames due to being defined dynamically within a function or other callable)I wouldn't expect a standard format here to solve either of these perfectly (they're fundamentally too dynamic), but both are worth making considerations around. In particular, for re-exports, it's probably worth nailing down whether all re-exports (potentially many!) will be listed for a vulnerable package, or whether the single original site of export gets listed (which itself might be funky, since it could be a "private"-to-public re-export).
Hi everyone, my name is Itamar Sher and this is my first time commenting here. So hello π Weβre identifying the affected function for vulnerabilities weβre patching in our public vulnerability patches database ( https://github.com/seal-community/patches). I planned to go via the regular contribution channel but since I noticed this discussion I thought it might be a good opportunity to collaborate on how we can automatically share/feed this information back into the advisory without encumbering the maintainers.
Python has a spec for a fully qualified name (from Python 3.3) https://peps.python.org/pep-3155/ I haven't dug too deep into it but it appears to be a walk down the AST for any given class. eg.
>>> class C:
... def f(): pass
... class D:
... def g(): pass
...
>>> C.__qualname__
'C'
>>> C.f.__qualname__
'C.f'
>>> C.D.__qualname__
'C.D'
>>> C.D.g.__qualname__
'C.D.g'
The one thing I would suggest be added to that form is that packages be rooted with their pep 503 normalized name
https://peps.python.org/pep-0503/#normalized-names
for the package.
eg.
packages foo
and bar
each with a top level function (or class) parse
would have that function (or class) referenced as
foo.parse
and bar.parse
Yeah, IMO the full __qualname__
is a reasonable starting point.
The one thing I would suggest be added to that form is that packages be rooted with their pep 503 normalized name https://peps.python.org/pep-0503/#normalized-names for the package
I might be misunderstanding, but I don't think this will be necessary: the OSV finding itself will contain the package name, so including it in the path will be redundant (as well as confusing, since the package name isn't guaranteed to be a valid Python identifier or the same as the top-level module name).
This also leaves open the re-export and decorator/locals issues, but I think a 99% solution is probably better to get started with π
I might be misunderstanding, but I don't think this will be necessary: the OSV finding itself will contain the package name, so including it in the path will be redundant
I think we do need a root of some sort to capture the case where the package is a single object/function.
Eg. Some package foo
where a user would use foo as
import foo
foo(some input)
package name isn't guaranteed to be a valid Python identifier or the same as the top-level module name
Can you expand on this? I think maybe I've confused package name with top level identifier. I was under the impression that the 503 package name was the top level identifier, but perhaps that's just a commonality and not a rule. If I'm in error there then yes maybe the top level identifier is the correct approach. Is there a pep for this?
Can you expand on this? I think maybe I've confused package name with top level identifier. I was under the impression that the 503 package name was the top level identifier, but perhaps that's just a commonality and not a rule. If I'm in error there then yes maybe the top level identifier is the correct approach. Is there a pep for this?
Correct: it's just a convention, not a rule. The two technically don't have to be similar at all, and the normalized package name still isn't guaranteed to be a valid Python identifier.
A good canonical example of this is PIL
and Pillow
-- Pillow
is a maintained fork of PIL
, so it still uses import PIL
as its top-level module despite being installed via pip install Pillow
.
As far as I know, there's no actual PEP stating this anywhere, it's just a quirk of how Python packaging works π
As a result of this, a __qualname__
like PIL.foo.bar.baz
may ambiguously refer to either the PIL
package or the Pillow
one, but the surrounding OSV content (including the package name) provides the disambiguation for us. So in theory just the __qualname__
should be sufficient, since it's understood it's coming from a specific package.
Ah yes, pillow. Good example π
I guess the question then is; when reading code how does one determine what the top level module is? I assume that's discoverable and that there's only one top level module.
So in theory just the qualname should be sufficient, since it's understood it's coming from a specific package.
Just to clarify; __qualname__
should include the top level module correct?
I guess the question then is; when reading code how does one determine what the top level module is? I assume that's discoverable and that there's only one top level module.
Yeah, the discovery process for matching a vulnerability to an environment (or source code) would roughly be:
import
s in the target, and normalize them (handling import aliases, from ... import ...
, etc.)Just to clarify; qualname should include the top level module correct?
Hmm, I thought it did, but from a quick look __qualname__
only shows the top-level name for a scope, rather than the full module resolution path. Using a random example from my host:
>>> from pip_audit import _cli
>>> _cli.AuditOptions.__qualname__
'AuditOptions'
So we'd need to go a little beyond __qualname__
, and also probably use __module__
and similar to build up the full path:
>>> _cli.AuditOptions.__module__
'pip_audit._audit'
(this has the nice effect of also canonicalizing the fully qualified name, since AuditOptions
isn't actually defined in _cli
, just imported into it).
So would this be a "simple" concatenation of __module__
and __qualname__
or a more complex operation? I'm not following where _audit
came from in your example and I'm quite (completely) ignorant of the code inside that pip_audit
module.
So would this be a "simple" concatenation of module and qualname or a more complex operation?
Yep, it should be "just" that 99% of the time (famous last words). It gets a little bit more complicated when the target is a module or package itself (perhaps it should never be, as a matter of policy?) but even in those cases it should be just different concatenations of __name__
, etc.
Sorry for the confusion with my example π
-- _audit
is another module under pip_audit
, so I was trying to demonstrate that a member imported from one module (pip_audit._cli
) can actually have a fully qualified name under another module (pip_audit._audit
). That was mostly to demonstrate the complexity of resolution, e.g. a user might do this in their code:
from pip_audit import _cli
_cli.AuditOptions(...)
and we would want to catch that, despite the "canonical" path for AuditOptions
being pip_audit._audit.AuditOptions
not pip_audit._cli.AuditOptions
.
Yep, it should be "just" that 99% of the time (famous last words).
So foo.__module__ + foo.__qualname__
?
Also, I did a quick look for the source of these two functions and couldn't get at them quickly. I'll try to dig them up and link them here when I get some more time to dig. I don't really wanna be in a world where the spec is basically whatever these two functions spit out
even in those cases it should be just different concatenations of name, etc.
We should iterate out those cases π
On your example; I guess AuditOptions
can be pulled in from _cli
in practice then it's available as a re-export or something? eg. _cli
pulls it in for whatever reason and a user could get access through that module?
I agree that we should stay focused on a canonical path and maybe this example brings up a question of how to capture aliases for a path, but maybe that's a question to resolve in v2.
So, then my question on AuditOptions
is; what makes pip_audit._audit.AuditOptions
the canonical path and how would I make that determination for a different project?
So foo.module + foo.qualname?
Yep, 99% of the time. There are some items in Python that don't have a __module__
or __qualname__
(like module objects themselves); these need to use __name__
instead (and maybe some other things that I'm forgetting?).
It'd be good to check what Python actually says about those attributes, but it's also worth noting that they're fundamentally dynamic anyways: a misbehaving class can do __module__
shenanigans, and can rewrite its own __qualname__
. There's very little Python will probably actually guarantee about these π
On your example; I guess AuditOptions can be pulled in from _cli in practice then it's available as a re-export or something? eg. _cli pulls it in for whatever reason and a user could get access through that module?
Yep, exactly. In practice, Python's lack of module visibility or explicit exporting means that any internal use of a package by itself results in re-exports that users can accidentally depend on.
So, then my question on AuditOptions is; what makes pip_audit._audit.AuditOptions the canonical path and how would I make that determination for a different project?
This is, unfortunately, a good question with no good answer π
As far as I know, Python's module system has no actual definition of "canonical" for a particular resolution path. As such, a tool/ecosystem like OSV will need to make its own determinations about how best to canonicalize paths. A few options:
class Foo
appears in is the canonical module, even if Foo
gets re-exported elsewhere.
somepkg._internal.Foo
rather than somepkg.public.Foo
).__all__
.
(There are also further problems, like packages that have canonical paths for public APIs but intentionally use __getattr__
to force dynamic lookups as part of their deprecation process. Pydantic is a good example of this.)
It'd be good to check what Python actually says about those attributes, but it's also worth noting that they're fundamentally dynamic anyways: a misbehaving class can do
__module__
shenanigans, and can rewrite its own__qualname__
. There's very little Python will probably actually guarantee about these π
Oh for sure. I'm not worried about modules that are intentionally obtuse for a v1 spec. 99% will get us a lot of value for most users and we can tackle the infinite regression of edge cases for v2 (or 3 maybe π)
But yes, lets formalize __module__ + __qualname__
as more than just the output of the functions and if that's straight forward we can probably start moving on that π€
This is, unfortunately, a good question with no good answer π
:(
As such, a tool/~ecosystem~ format like OSV will need to make its own determinations about how best to canonicalize paths.
I guess for the moment we could also say that it's a data providers obligation to do a best effort canonicalization. There will be aliases in practice of course, but maybe that's an acceptable choice. I tend to prefer to match whatever a user of a package would use so that scanning tools can be more easily maintained/made to be lower noise so if there's a re-export pattern in some library but also the developer documentation advises users to use that re-export then we prefer that export. That's human guidance I know.
But yes, lets formalize module + qualname as more than just the output of the functions and if that's straight forward we can probably start moving on that π€
Sounds good! As a starting point: we can say that these names take the form:
module1.module2.module3:attribute
...where the LHS of the :
is an absolute importable module path, i.e. a value that could be passed into importlib.import_module(...)
without needing a relative package reference to resolve against.
The RHS is a gettable attribute within the importable module, meaning that getattr(module3, attribute)
is expected to not raise AttributeError
(although this isn't guaranteed).
The reason I suggest :
as the attribute delimiter is for consistency with other bits of the ecosystem: it's what setuptools and other packaging tooling use for entry_points
syntax, and it also shows up in Pyramid and a few other frameworks. So there's precedent for it, and it maintains separation between the "importable component" and "gettable component."
When an entire module is meant to be flagged in a finding, we can probably just drop the :attribute
, e.g.:
module1.module2.module3
...which follows the same importlib.import_module(...)
rules.
I guess for the moment we could also say that it's a data providers obligation to do a best effort canonicalization. There will be aliases in practice of course, but maybe that's an acceptable choice. I tend to prefer to match whatever a user of a package would use so that scanning tools can be more easily maintained/made to be lower noise so if there's a re-export pattern in some library but also the developer documentation advises users to use that re-export then we prefer that export.
That seems very reasonable to me. There isn't a perfect answer here, so I think anything (like this) that gets us to the 99% case is perfectly suitable π
Any chance you can give some examples of the :
usage? Not sure I follow on what you're suggesting there.
Any chance you can give some examples of the
:
usage? Not sure I follow on what you're suggesting there.
Sure: using the example above:
pip_audit._audit.AuditOptions
would become:
pip_audit._audit:AuditOptions
On the other hand, a report for an entire module would remain the same:
foo.bar.baz
That's really the only change; the main reasons I suggest it are because it's what other tools in the Python packaging ecosystem do, and because it disambiguates different parts of the path well.
Assuming we're going off https://github.com/pypa/pip-audit/blob/main/pip_audit/_audit.py
for your example AuditOptions
is a class. I guess I'm not following why that would change. One could still import pip_audit._audit.AuditOptions
correct?
One could still import
pip_audit._audit.AuditOptions
correct?
Not through the importlib
machinery:
>>> import importlib
>>> importlib.import_module("pip_audit._audit.AuditOptions")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/william/.pyenv/versions/3.11.2/lib/python3.11/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
File "<frozen importlib._bootstrap>", line 1137, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'pip_audit._audit.AuditOptions'; 'pip_audit._audit' is not a package
>>> importlib.import_module("pip_audit._audit")
<module 'pip_audit._audit' from '/Users/william/devel/pip-audit/pip_audit/_audit.py'>
Similarly with the import
keyword:
>>> import pip_audit._audit.AuditOptions
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pip_audit._audit.AuditOptions'; 'pip_audit._audit' is not a package
That's why the disambiguation with :
is nice: it makes it clear which parts can actually be imported (which a triaging tool might need to do) versus which parts are attributes.
Ah ok, so the :
wouldn't necessarily be a path indicator, but an indicator of how to interact with the object. I'm wondering how I would validate that and I'm not sure I can commit to supporting it reliably. Any thoughts on that end?
Ah ok, so the
:
wouldn't necessarily be a path indicator, but an indicator of how to interact with the object.
Right: it's an indicator that the thing on the RHS isn't a part of the path, but is rather a class, object, function, etc. (i.e. the things that are the actual vulnerable API surface, rather than the path to the vulnerable API surface).
I'm wondering how I would validate that and I'm not sure I can commit to supporting it reliably. Any thoughts on that end?
On one level, validating this is the same as it would be with the foo.bar.Baz
syntax -- you just check to see if the input matches \w[\w\d]*(\.\w[\w\d])*(:\w[\w\d]*)?
(very roughy).
If you're asking how you'd check that the LHS is a valid path and the RHS is a valid attribute: you can either do it dynamically with importlib
+ getattr
or get 95% of the way there statically. But neither is 100% guaranteed to resolve all valid attributes under valid paths.
(This is true for the foo.bar.Baz
syntax as well -- there's no generalized way to validate that a particular Python module/attribute path is "real" for all users. So I don't think this syntax has any advantages in that respect; the main advantages are that the Python community already uses it elsewhere and it's slightly less ambiguous about the RHS.)
If you're asking how you'd check that the LHS is a valid path and the RHS is a valid attribute:
This. Intuitively it seems easier to discover valid foo.bar.Baz
paths than to know if the trailing element is an attribute or not. Knowing if something is an attribute or not seems like it would require much more familiarity with any given code base. I do agree it's less ambiguous, but if we can't reliably provide that data would foo.bar.Baz
be an acceptable fallback?
Intuitively it seems easier to discover valid foo.bar.Baz paths than to know if the trailing element is an attribute or not.
Could you elaborate a bit more on what you mean by "discover valid"? In a strict sense, foo.bar.Baz
and foo.bar:Baz
are exactly as easy to discover and exactly as hard to validate; the benefit of the :
is solely to disambiguate the actual API being identified versus a path to it (and because it's what the rest of the packaging ecosystem uses, as mentioned before).
For example, foo.bar.Baz
could refer to any of the following things:
Baz
through foo.bar
Baz
through foo.bar
Baz
under foo.bar
And the same is true for foo.bar:Baz
.
I think the TL;DR of my position is that Python's dynamic nature makes it almost impossible to actually "validate" anything here, whether you use :
or .
as the final delimiter -- the value of :
for the attribute delimiter doesn't come from additional validation power, but instead from consistency with other Python tools and the intent it communicates π.
Could you elaborate a bit more on what you mean by "discover valid"?
Sure. Part of my job is in populating this data and discovery of paths like foo.bar.Baz
is its own complexity, but is something I've been able to do with reasonable consistency. The :
qualifier would make the data more useful for sure, but I am not confident I would be able to support that reliably. So, I don't want to be in a position where I'm generating bad data for the rest of the world.
Now, if the :
is optional it could certainly be the case that I primarily output foo.bar.Baz
and users come and correct me over here. No doubt they'll come correct me for the dot separated strings too, but I guess I'm concerned about the expectation as I expect to fail if :
is expected π
Thanks for explaining!
I understand your position, and I'm also very sympathetic to not adding more work to an already manual process π. I think I might have given you the wrong impression about how difficult this is, by talking about pathological cases: in practice, 99.9% of paths that you discover as foo.bar.Baz
can be automatically rewritten as foo.bar:Baz
, even when Baz
is something like a module itself.
That being said, I recognize that even that may be a dealbreaker for you. Ultimately any amount of path/API information will be beneficial here, so I'll stop hammering on :
and just say that this is all good and I'm looking forward to any variant of it coming into existence π
Alright, if you're comfortable with it I'd prefer to keep the spec location based (and slightly easier π ) for v1 with just a dot separated path to describe the location of a vulnerable object.
I still need to get some time to go find the source for __module__
and __qualname__
, but I think we're basically in agreement on what to do π
Ok, so I may have π 'd a bit early π
First off for __module__
and __qualname__
I can't actually find hard definitions. Best as I can tell both are PyObject
types from the C codebase, but that's not super useful. Both funcobject.h
and methodobject.h
mention that /* The __module__ attribute, can be anything */
https://github.com/python/cpython/blob/main/Include/cpython/funcobject.h#L42
https://github.com/python/cpython/blob/main/Include/cpython/methodobject.h#L11
That said, In doing some testing to validate what we've been talking about and to ensure we have some good examples I came across a django case of (I think) a re-export which will likely matter for our use case. Using django 4.2.5
for the test.
(vea-testing-venv) jon~/V/vea-testing-venvβ―β―β― python
Python 3.11.6 (main, Oct 3 2023, 02:51:45) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from django.db.models import JSONField
>>> JSONField.__module__
'django.db.models.fields.json'
>>> JSONField.__qualname__
'JSONField'
Which seems to be down to this line in the django/blob/main/django/db/models/__init__.py
file
https://github.com/django/django/blob/main/django/db/models/__init__.py#L97
A number of other fields are exported in the same way and while the module + qualname does seem like a more canonical representation I'm left wondering how to bridge the gap so that a tool scanning code can connect that to django.db.models import JSONField
/ django.db.models.JSONField
Sorry for the delay on getting into the weeds here but any thoughts?
First off for
__module__
and__qualname__
I can't actually find hard definitions. Best as I can tell both arePyObject
types from the C codebase, but that's not super useful. Bothfuncobject.h
andmethodobject.h
mention that/* The __module__ attribute, can be anything */
Yep, this is what I mentioned in https://github.com/pypa/advisory-database/issues/149#issuecomment-1741167885 -- Python will not make any guarantees about the values of these attributes, because (for CPython) they're just PyObject
s that can point to anything (including their containing object, leading to reference cycles!)
Sorry for the delay on getting into the weeds here but any thoughts?
Unfortunately, I don't know of a great generalization for handling these kinds of canonical paths π -- there's no particular guarantee that a given attribute has only one "canonical" path, much less that multiple modules don't use the __all__
mechanism for their re-exports.
I know I sound like a broken record at this point, but my recommendation here would be to sidestep the entire "what is a valid __module__
/__qualname__
" problem and think about this entirely in terms of things that Python does strongly define, like importlib.import_module
's semantics. That won't get you the "canonical" path(s) for a given attribute, but it will avoid the shenanigans that occur, in practice, with __module__
and __qualname__
.
For getting a potential (non-exhaustive) set of "canonical" paths, one possibility would be to use pkgutil.walk_packages
to walk each module, look for an __all__
attribute, and match re-exports from there. But this has significant downsides: you'll need to run arbitrary module code (since it's all dynamic Python), and it's not guaranteed to work at all (since __all__
isn't required for re-exports, and in fact isn't commonly used by libraries that don't expect to be wildcard imported).
TL;DR: Whether you use foo.bar.Baz
or foo.bar:Baz
, I'd recommend treating the foo.bar
part as an abstract "path" that obeys according to importlib.import_module
's semantics, since that's (essentially) the only thing that Python will guarantee here. In terms of determining whether foo.bar.Baz
is "canonical" in some sense, you can check whether foo.bar.__all__
contains "Baz"
(but note the significant constraints above).
Yep, I think you're right and I just got wrapped up in my head π€¦
Ok, so we ignore the idea of canonical paths for the time being and focus instead on import path syntax. I guess first question; is there a name for that?
Going back to the django example; we would then have
django.db.models.JSONField
as derived from from django.db.models import JSONField
which could also be represented as
django.db.models.fields.json.JSONField
derived from from django.db.models.fields.json import JSONField
and the two would be independent and valid paths?
Ok, so we ignore the idea of canonical paths for the time being and focus instead on import path syntax. I guess first question; is there a name for that?
I'm not aware of an official name for this; the importlib
documentation sometimes calls it a "full path" or similar. I think you can just call it a "path" or "location" or similar.
There may be some additional guarantees documented under __spec__
and ModuleSpec
(https://docs.python.org/3/library/importlib.html#importlib.machinery.ModuleSpec), but I don't think they'll directly help here.
and the two would be independent and valid paths?
Yep π
I'm not aware of an official name for this
Hmmmm. Feature request there I guess :)
Digging around the importlib code it looks like the there haven't been any big changes in over a decade so, probably pretty stable. I guess for now we can call them import paths
or something.
Yep π
Cool π
@woodruffw sorry for dropping off on my end. To keep this going, I think we have importlib paths rooted at the top level module of the form
django.db.models.JSONField
which express a path to an object in a given pypi package (eg. django).
Multiple paths may point to the same object but all paths must be valid import paths.
The paths may point to any object (class, function, constant, etc...)
The collection of all paths for a given package could be considered the API for the package.
If all that makes sense and is agreeable I think the next question is; where does this live? Is this something to make a PR for in this repo or is there a better place?
No problem!
I think we have importlib paths rooted at the top level module of the form
django.db.models.JSONField
which express a path to an object in a given pypi package (eg. django).
To clarify:
django
here is the top-level module, not the top level PyPI package. They often share the same name, but there's no particular guarantee that they do (cf. earlier discussion about PIL
and Pillow
. Python's importlib machinery has no concept of a PyPI package name π django.db.models.JSONField
isn't technically an importlib path -- the django.db.models
part is that path, and JSONField
is an arbitrary attribute that may be resolvable at the end of the path. If you pass django.db.models.JSONField
into importlib
, you'll get an error.Sorry again if this comes across as pedantic, but I want to make sure we're using the same terms here. The rest makes sense to me!
where does this live? Is this something to make a PR for in this repo or is there a better place?
My first thought would be here, or in one of the OSV repositories (maybe they already document similar syntaxes for other ecosystems? CC @oliverchang). @sethmlarson may also have opinions π
The collection of all paths for a given package could be considered the API for the package.
In terms of expressing this, I had another thought: rather than creating these pseudo-paths like django.db.models.JSONField
, it might be better to devolve this at the format level. In pseudo-JSON:
{
attribute: "JSONField",
paths: ["django.db.models", "django.something.else"]
}
This is slightly more verbose upfront, but comes with a few benefits:
django.db.models.JSONField
, django.something.else.JSONField
, etc.The information for affectedness @woodruffw works for me. Is there a reason you're using path
as the key instead of module
? For the format, Go appears to have the two swapped having multiple potential attributes per module which if we're only encoding the initial definition point having multiple attributes per module seems favorable? ie:
{
"ecosystem_specific": {
"affects": [
{
"attrs": ["JSONField"],
"module": "django.db.models"
}
]
}
}
(example assumes that JSONField
is imported into django.something.else
and its up to tools to figure that out)
Also, are we expecting to fully enumerate all places a name is re-exported or is that the job of tooling to walk an import ast and determine affectedness (tools will need to do this anyways in order to determine affectedness from source files, so this doesn't feel like a huge additional burden for an easier experience encoding advisories?)
In terms of where do these choices live, we can have that live in a CONTRIBUTING.md
file on this repository? Other OSV databases might have better ideas? The PSF Advisory Database will use whatever exact format we decide on too.
Is there a reason you're using
path
as the key instead ofmodule
?
Nope, module
makes way more sense (and is consistent with what I've been yakking about).
Also, are we expecting to fully enumerate all places a name is re-exported or is that the job of tooling to walk an import ast and determine affectedness (tools will need to do this anyways in order to determine affectedness from source files, so this doesn't feel like a huge additional burden for an easier experience encoding advisories?)
My understanding is the latter, with the idea that the format can be a "best effort" source of initial information (e.g. if we know that a project has a few really popular re-exports, it probably makes sense to list them rather than having end user tooling have to do that lifting.)
I'm ++ on best effort too, doesn't /have/ to be the original definition location if the API is defined a certain way (like requests.request()
actually being defined elsewhere).
django here is the top-level module, not the top level PyPI package. They often share the same name, but there's no particular guarantee that they do (cf. earlier discussion about PIL and Pillow. Python's importlib machinery has no concept of a PyPI package name
Gah, sorry you're correct. What I meant by the package part is that the osv path data will be subordinate to an affected product which itself will be a pypi package. That's me conflating things though.
A path like django.db.models.JSONField isn't technically an importlib path -- the django.db.models part is that path, and JSONField is an arbitrary attribute that may be resolvable at the end of the path. If you pass
I do like the idea of attributes, but again I really question my ability to provide those with high fidelity. π Perhaps that's something I simply have to deal with though....
re:
{
attribute: "JSONField",
modules: ["django.db.models", "django.something.else"]
}
I do like that as an approach to method to de-dupe paths to an affected code point.
I wanna double check that the statement The collection of all paths for a given package could be considered the API for the package.
still works for you though.
Using the example before this would deconstruct to
django.db.models.JSONField
&& django.something.else.JSONField
and the collection of all possible paths to objects could be considered an API for the ~package~ top level module.
No problem!
I think we have importlib paths rooted at the top level module of the form
django.db.models.JSONField
which express a path to an object in a given pypi package (eg. django).To clarify:
django
here is the top-level module, not the top level PyPI package. They often share the same name, but there's no particular guarantee that they do (cf. earlier discussion aboutPIL
andPillow
. Python's importlib machinery has no concept of a PyPI package name π- A path like
django.db.models.JSONField
isn't technically an importlib path -- thedjango.db.models
part is that path, andJSONField
is an arbitrary attribute that may be resolvable at the end of the path. If you passdjango.db.models.JSONField
intoimportlib
, you'll get an error.Sorry again if this comes across as pedantic, but I want to make sure we're using the same terms here. The rest makes sense to me!
where does this live? Is this something to make a PR for in this repo or is there a better place?
My first thought would be here, or in one of the OSV repositories (maybe they already document similar syntaxes for other ecosystems? CC @oliverchang). @sethmlarson may also have opinions π
I think it would make sense for this to belong in this DB, and the OSV schema can link to here as the authoritative source of this. e.g. the OSV schema links to https://go.dev/security/vuln/database for the Go-specific bits (including how they identify vulnerable symbols).
I do like the idea of attributes, but again I really question my ability to provide those with high fidelity. π
Sorry if you've already explained this upthread, but I think this should really just be a mechanical transform on your end. In other words, if you have the ability to present identifiers of the form foo.bar.Baz
, then presenting the "module and attribute" form should be no more complicated than:
mod_path, attr_name = "foo.bar.Baz".rsplit(".", 1)
json.dumps({"module": [mod_path], "attribute": attr_name})
In other words: presenting it in this form doesn't imply anything different about the data quality; it's just a way to de-dupe.
I wanna double check that the statement The collection of all paths for a given package could be considered the API for the package. still works for you though. Using the example before this would deconstruct to django.db.models.JSONField && django.something.else.JSONField and the collection of all possible paths to objects could be considered an API for the package top level module.
Yep, that sounds good (and correct) to me!
Sorry if you've already explained this upthread, but I think this should really just be a mechanical transform on your end. In other words, if you have the ability to present identifiers of the form foo.bar.Baz, then presenting the "module and attribute" form should be no more complicated than: ...
It's mostly my own apprehension and fear honestly :) I'm willing to trust though and sign on with the idea that we should annotate attributes on modules π
That said I will pester you with a few questions too π
So, for the fix commit https://github.com/python-pillow/Pillow/commit/1fe1bb49c452b0318cad12ea9d97c3bef188e9a7
A max length value is added to PIL.ImageFont
>>> import PIL
>>> PIL.ImageFont
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: module 'PIL' has no attribute 'ImageFont'
>>> from PIL import ImageFont
>>> ImageFont
<module 'PIL.ImageFont' from '/Users/jon/venv/lib/python3.11/site-packages/PIL/ImageFont.py'>
>>>
I think PIL.ImageFont
is the module in question with MAX_STRING_LENGTH
as a new attribute
However I would say in english that PIL.ImageFont
is the affected component and with the format described above I would store that as
{
attribute: "",
modules: ["PIL.ImageFont"]
}
Where I think the lack of attributes can be read as a lack of refinement on the look up. eg. if no attribute is listed then assume any/all
. Deeper analysis could refine that further, but I want to be sure that this read of the information is correct since a lot of the discovery work is so manual and hence time constrained.
It's mostly my own apprehension and fear honestly :) I'm willing to trust though and sign on with the idea that we should annotate attributes on modules π
Very understandable! I think the best way to think about this is as a purely mechanical operation, so there should be no additional context or complexity on the producer side.
I think PIL.ImageFont is the module in question with MAX_STRING_LENGTH as a new attribute However I would say in english that PIL.ImageFont is the affected component
In that case, I would express that as:
{
attribute: "ImageFont",
modules: ["PIL"]
}
...since a module (like ImageFont
) is an attribute. That should hopefully emphasize what I mean by "purely mechanical": any path like foo.bar.Baz
can be transformed into this representation without any additional context or loss of generality/information π
That does help quite a bit. I missed that modules are attributes. That clears up a lot on my end and I agree with the OSV style representation too. Do we want to say also that PIL:ImageFont
would be an equivalent representation? Also, since I think we're basically aligned I'd love to get a PR going. Is there a specific file I should open a PR against? CONTRIBUTING.md
as suggested above?
Do we want to say also that
PIL:ImageFont
would be an equivalent representation?
Yep! That was what I was sort of going for with the original path:attribute
thing, but I'm glad we actually settled on this "many modules, one attribute" JSON object layout instead π
Is there a specific file I should open a PR against? CONTRIBUTING.md as suggested above?
That makes sense to me -- maybe @sethmlarson has opinions here as well.
Yep! That was what I was sort of going for with the original path:attribute thing, but I'm glad we actually settled on this "many modules, one attribute" JSON object layout instead π
Just to be 100% clear I'm suggesting both where the many modules, one attribute would essentially be a short form. I can just see the single line form being useful in a lot of more conversational contexts as well as possible in some tooling settings.
Just to be 100% clear I'm suggesting both where the many modules, one attribute would essentially be a short form. I can just see the single line form being useful in a lot of more conversational contexts as well as possible in some tooling settings.
Gotcha, makes sense to me!
Context: https://github.com/ossf/osv-schema/issues/127 Summary: Many ecosystems/databases are starting to encode more fine grained, "vulnerable symbol" information in advisories.
Go has defined their own
ecosystem_specific
format for describing affected functions: https://go.dev/security/vuln/database#schema. Most vulns in the Go vuln DB have this.RustSec has this in their security advisories: https://github.com/rustsec/advisory-db/blob/ee74d43ec2c6203750e38a254f4029f6181c4362/crates/RUSTSEC-2021-0027.json#L30
GHSA also has their own: https://github.com/github/advisory-database/blob/046aafb21a71eadbe8bdd5a72b1491a5f1209fb7/advisories/github-reviewed/2023/07/GHSA-57fc-8q82-gfp3/GHSA-57fc-8q82-gfp3.json#L25
Given that ecosystems have their own nuances for describing these accurately (e.g. OS, arch specifiers), it makes more sense for ecosystems to each define their own
ecosystem_specific
field. We should define anecosystem_specific
for Python as well as part of this DB.