Open riwoodward opened 11 months ago
Thanks for the detailed report @riwoodward! We greatly appreciate it.
Your understanding is correct: pip-audit
currently always needs some form of online access, whether to PyPI or OSV.
I'll let @di opine here as well, but here's my thoughts:
pip-audit
knows how to parse the PyPA Advisory DB": doing so might hamper their ability to make breaking changes, and would also put us in the hotseat whenever they do. In other words: we get a lot of compatibility benefit out of only consuming the Advisory DB through the PyPI JSON API, which has standardized how it presents it 🙂 pip-audit
quite a bit.So TL;DR: I think we could do this if (1) we can get some kind of stability guarantees about the Advisory DB's format, and (2) doing so doesn't make our ultimate plans to integrate into pip
any harder. But I'm not sure if we should, given that the "right" (but admittedly annoying) way to do this is to mirror the JSON API locally.
I don't think we should try to integrate against the advisory DB directly, mostly for the reasons @woodruffw mentioned.
I do think we could possibly support some kind of local file cache, but I think I need to understand the use case a bit more before I can say that would be a useful thing to add.
Overall I don't think I really understand how or why this feature will be used... assuming you have a lockfile and a snapshot of an advisory database, nothing is going to change: at the time you're online to get the snapshot, you already know about all the possible vulnerabilities for the subset of dependencies you have. Any offline audit with these two things is always going to produce the same result.
The main point of pip-audit is to find new vulnerabilities in dependencies you're using as they are discovered. The other use case is being aware of existing vulnerabilities when adopting new dependencies, but again, I don't see how you could be introducing new dependencies somewhere offline where you wouldn't be able to also query for vulnerabilities.
Is the general goal just to remove a dependency on an external service? That's why we provide PyPI as a provider, because you already have a dependency on PyPI anyways for everything else you need for installation.
Thanks both for the comments.
As @di suggests, perhaps I can flesh out the real-world use case more clearly.
In high-security environments, CI runners and build boxes etc are often kept offline. Dependencies that are required can be mirrored to a dedicated local mirror machine on the LAN. Pip-audit sounds ideal for "finding new vulnerabilities in dependencies as they're discovered" as di suggests, which could be achieved using e.g. scheduled CI pipelines where a job includes the vulnerability check. The problem, however, is: how can one keep the pip-audit informatoin source updated without internet access. Without an up-to-date vulnerability database, new vulnerabilites in a project won't be discovered and the developer won't be alerted automatically.
Pip-audit is still useful to developers if they run the command manually during development on machines with Internet access, but I believe there is an opportunity to make this even more useful and add "offline" support.
I like the "mirror the PyPA advisory database" approach to some internal LAN mirror since then you're only storing the vulnerability info. Mirroring the whole PyPI JSON API would take significantly more space, bandwidth etc and most of that wouldn't be useful data. If you're aware of a good open-source tool for mirroring the API by the way, please let me know as it seems only partially supported in most packages that do this (e.g. devpi). At present, pip-audit hard-codes the PyPI URL anyway so even a local JSON API mirror would still need work to implement pip-audit support.
I understand the concerns @woodruffw raises to using PyPA's repo though. For now, I guess I'll just have to use my hacked approach at the start of this issue, as that does work well for me. But I'll keep on eye on the project in case this (or another offline method) becomes an official feature.
Thanks again for the great tool too!
In high-security environments, CI runners and build boxes etc are often kept offline. Dependencies that are required can be mirrored to a dedicated local mirror machine on the LAN. Pip-audit sounds ideal for "finding new vulnerabilities in dependencies as they're discovered" as di suggests, which could be achieved using e.g. scheduled CI pipelines where a job includes the vulnerability check. The problem, however, is: how can one keep the pip-audit informatoin source updated without internet access. Without an up-to-date vulnerability database, new vulnerabilites in a project won't be discovered and the developer won't be alerted automatically.
I guess what I'm missing is: why does the check need to happen offline? Why can't it happen at the time you would "download" the cache, which you would need to be online for anyways?
I guess what I'm missing is: why does the check need to happen offline? Why can't it happen at the time you would "download" the cache, which you would need to be online for anyways?
Banks and similar environments have dedicated security teams that have their own idea of safety and convenience. They might agree in establishing a very narrow channel to download configuration data, a vulnerability database, etc.
Large organizations also have a policy to "accept risk". This might translate to having a vulnerability database which is a few days or a week old. Cyber security people are held accountable for their decisions, hence they want to stay in control—they define the price.
They want to make sure that an attacker has an attack surface as minimal as possible. They're not totally wrong: A lot can happen with direct access to the Internet. Supply chain attacks being another enormous danger, followed by social engineering.
Bottom-line: Air-gapped environments are real. Solutions are needed. See GitLab's security scanning offerings for an example. They didn't invent that themselves, banks and other customers pushed them. I know it, first-hand.
I'm sympathetic to the idea that there are contexts/setups where these checks should happen offline, but I think the engineering points in https://github.com/pypa/pip-audit/issues/698#issuecomment-1826992607 are still outstanding: we can't (reasonably) support an offline mode in pip-audit
until the PyPA advisory DB is guaranteed to not change formats on us. Right now we get that property by indirectly relying on it via the PyPI JSON API, but it's not a guarantee of the DB itself. So upstream coordination is required before we can even think about an improved offline workflow here.
Linking things together: #805 presents a similar need.
Is your feature request related to a problem? Please describe.
Currently two vulnerability services are offered,
pypi
andosv
, but these are both based on pip-audit retrieving information from the Internet (e.g. with URLs specified in the code). This is a problem when needing to operate on offline machines or with limited Internet access. As it stands, I don't think pip-audit can operate without an available Internet conncection?Describe the solution you'd like
I would like an option to execute pip-audit using a locally available copy of the advisory database. For example, I could maintain a local mirror on my network for the PyPA advisory database repo (https://github.com/pypa/advisory-database), then when pip-audit needs to be run on any offline machine on my network, I could simply retrieve from the local mirror and pass the path for this to pip-audit for it to use as a vulnerability service. This would be particularly useful for offline / air-gapped CI systems.
I hacked together a quick implementation for this and it works well. I just modified the
query
function ofpip_audit/_service/pypi.py
to become as below, and passed the path to the local copy of PyPA advisory database repo as an env var (e.g.export PIPAUDITDB=~/advisory-database
)I guess to make this an officially supported feature, the path to the PyPA repo could be specified using the
-s SERVICE, --vulnerability-service SERVICE
arg? Or make another option beyondosv
andpypi
, e.g.pypa-repo
with another arg for the path to said repo?I note that offline indexes are already supported using the
--index-url
arg, so this could be complementary?If interested, I could put together a PR (i.e. taking the above approach and adding error handling, proper use of args, tidy up etc)? I wanted to see what you thought of the method / proposed approach first though.
Describe alternatives you've considered
Running a local PyPI mirror including the JSON advisory info could work but that would be considerably more effort and resource usage to achieve the same goal.
Additional context
Really great tool otherwise - thanks!