pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.54k stars 952 forks source link

Validate the contents of identity centric metadata #8635

Open dstufft opened 3 years ago

dstufft commented 3 years ago

Currently if I'm looking at a project on PyPI, it can be difficult to determine if it's "real" or not. I can look and see the user names that are publishing the project as well as certain key pieces of metadata such as the project home page, the source repository, etc.

Unfortunately, there's no way to verify that a project that has say.. https://github.com/pypa/pip in it's home page, is actually the real pip, and isn't a fake imposter pip. The same could go for other URLs, or email addresses etc. Thus it would be useful if there was some way to actually prove ownership of those URLs/emails, and either differentiate them in the UI somehow, or hide them completely unless they've been proven to be owned by one of the publishing users.


Metadata to verify:

jamadden commented 3 years ago

I recall issues like this coming up at least once, if not a few times, over in pypi-support. Someone would fork a repository, change the name in setup.py, and then upload it to PyPI, with or without any other changes. All of the documentation and other links would remain pointed to the originals. This was confusing for users and frustrating to owners of the original package.

So I'm 👍 for some sort of blue verified checkmark or something from that perspective.

With my publisher hat on, though, I would hope this would be completely automated and I wouldn't have to do anything special to earn that blue checkmark.

ewjoachim commented 3 years ago

One idea: we could add a blue checkmark for all links in the sidebar that contain a link back to the project's pypi page or pip install <project-name>. This would force us to load those links on the server, but it would be zero-effort for most packages.

That being said, it wouldn't help if they point to forked versions, but in that case, the github star count might be a tell.

calebbrown commented 1 year ago

👍

Any progress on this issue?

I've been looking at malware from PyPI and it is common for the author_email to be "spoofed" (either pointing to nowhere, or using somebody else's email address).

Some related context is this HN discussion: https://news.ycombinator.com/item?id=33438678 Many commenters are asking about providing this sort of information.

I see some considerations that need discussion:

Some validation is easier than others as well - e.g. email validation is pretty straightforward, but homepage validation would require something like the ACME protocol.

ewjoachim commented 1 year ago

Haha, rereading my 2-year-old comment above about a blue check marks seems to resonate strangely in today's terms :sweat_smile:

Who would have guessed...

di commented 1 year ago

My general thoughts here is that for metadata that we can 'verify', we should probably elevate that metadata in the UI over 'unverified' metadata.

We can already validate email addresses that correspond to verified emails of maintainers. That won't include the ability to verify mailinglist-style emails, but that could potentially be added to organizations once that feature lands.

With #12465, we'll be able to 'validate' the source repository as well, so any metadata that either references the given upstream source repository can be considered verified as well.

I agree that domains/urls will need to use the ACME protocol or something similar. I think there's probably a UX question on how these would be done per-project, if we wanted to go that route.

ewjoachim commented 1 year ago

Mastodon has a link verification system, that might be nice.

That's never going to be foolproof though.

miketheman commented 1 year ago

Related: #8462 #10917

jayaddison commented 7 months ago

From attempting to perform identity-assurance checks on packages manually: bidirectional references can be a reassuring indicator.

In context here: when a PyPi package points to a GitHub repository as its source code, then that's interpretable as a useful but as-yet-untrusted statement. When up-to-date references are inspected within the contents of the cloned linked repository and they point back to the same original package on PyPi, then confidence in the statement increases.

For reproducible-build-compliant packages the situation improves further: any third party can confirm not only that the source origin and package destination are in concordance, but also whether the published artifact from the destination is bit-for-bit genuine by comparing it a build-from scratch of the corresponding raw origin source materials. This can be verified on both a historic and ongoing basis.

So that's two orthogonal identity validation mechanisms:

These don't prevent an attacker copying the source in entirety and creating a duplicate under a different name with an internally-consistent reference graph. Given widespread free communication I think it's reasonable to expect that enough of the package consumer population will be (or become) aware of and gravitate towards the authentic package to solve that problem.

di commented 6 months ago

Following on to my previous comment, here's a mockup of what I'm imagining to separate the metadata we can verify today (source repository, maintainer email, GitHub statistics, Owner/Maintainers) from the unverifiable metadata:

image

Over time we can move things from below the fold to above it, but this should be a big improvement as-is for now.

I pushed the diff for the mockup here, there's some hacky stuff in there just to get the mockup to look good, but it could be a good starting point.

javanlacerda commented 5 months ago

I'm starting working on this for creating the verified session and adding "Owner"/"Maintainers" on that :)

ewjoachim commented 5 months ago

I wonder if it makes more sense to have verified details and then unverified details or to have each category with a verfified sub-section and a non-verified sub-section. It feels weird to break the project links apart from one another. When your eyes have reached the place where the repository is, it's not very clear that if the documentation isn't there, you have to look a some place else entirely to find a different link section that might contain the link to the docs.

I'd even argue that in this case, the whole thing would look more readable if the project doesn't use trusted publishers, which is

What about something like this ? (not arguing it's better, just a suggestion for the discussions)

image

(also, this needs a link, or a hover infobox or something to lead people to the documentation that says what this means, what this certifies, and how to get the various parts of their metadata certified)

(Would @nlhkabu have an opinion on the matter ?)

di commented 1 month ago

16205 Starts marking URLs as verified, this now just needs to be surfaced in the UI.

facutuesca commented 1 month ago

16205 Starts marking URLs as verified, this now just needs to be surfaced in the UI.

I'm working on the UI part now

di commented 1 month ago

Reopening this, we have solved this for a subset of project urls that relate to Trusted Publishing, but these two remain:

"Metadata" should only be emails that are included in Author-Email or Maintainer-Email that are also verified user emails for any collaborator on the project "GitHub Statistics" should only be included if "Source" is verified

I think we also want to think about a solution for validating non-trusted publisher project URLs (e.g., ACME).

woodruffw commented 1 month ago

I think we also want to think about a solution for validating non-trusted publisher project URLs (e.g., ACME).

I am a massive fan of this idea 🙂

To spitball a little bit, a given FQDN's (foo.example.com) resources could be considered verified for a project example if:

  1. foo.example.com is a secure origin (HTTPS)
  2. foo.example.com/.well-known/pypi exists and is JSON
  3. The contents of the pypi JSON resource are something like this:
{
  "version": 1,
  "packages": ["example"]
}

(where packages is plural since a FQDN may host links/resources for multiple projects.)

Another lower-intensity option would be the rel="me" approach that Mastodon and similar services use. This approach has the benefit of being per-resource (meaning that the entire FQDN isn't considered valid, since the user may not control the entire FQDN), at the cost of requiring the user to tweak their HTML slightly. Using example again:

  1. foo.example.com is a secure origin
  2. /some/resource on foo.example.com is (X)HTML
  3. /some/resource contains a <link ...> as follows:
<head>
  <link rel="me" href="https://pypi.org/p/example">
</head>

Like with the .well-known approach, the user could include multiple <link> tags to assert ownership of multiple PyPI projects.

Alternatively, this could use meta instead to prevent implicit resolution of links:

<head>
  <meta rel="me" namespace="pypi.org" package="example">
</head>

...or even multiple in the same tag:

<head>
  <meta rel="me" namespace="pypi.org" package="example another-example">
</head>

Edit: one downside to the rel="me" approach is that PyPI needs to parse (X)HTML.

Ref on Mastodon's link verification: https://docs.joinmastodon.org/user/profile/#verification

di commented 1 month ago

Using .well-known would be interesting but I think this wouldn't cover enough use cases (I'm thinking things like Read the Docs pages, etc, where the user doesn't control the FQDN).

Using rel="me" makes sense to me. We probably also need to think about verifying these outside the upload loop via a task.

woodruffw commented 1 month ago

Using .well-known would be interesting but I think this wouldn't cover enough use cases (I'm thinking things like Read the Docs pages, etc, where the user doesn't control the FQDN).

Yeah -- my thinking was that it'd be most useful for projects/companies with full-blown domains, e.g. company $foo might want to blanket-verify all of their PyPI project names without having to add them to each docs page/homepage. But that's probably a much more niche case than Read The Docs etc. 🙂

Using rel="me" makes sense to me. We probably also need to think about verifying these outside the upload loop via a task.

Makes sense! That reminded me: do we also want to consider periodic reverification? My initial thought is "no" since URLs are verified on a per-release basis, but I could see an argument for doing it as well (or at least giving project owners the ability to click a "reverify" button once per release or similar).

di commented 1 month ago

I think I would want to stick to the "verified at time of release" model, at least for now (not trying to reinvent Keybase here).

ewjoachim commented 1 month ago

If we specifically plan for this to be used on ReadTheDocs, it makes sense to ensure that whatever format we decide on is easy to use with Sphinx & mkdocs.

I've made a small test with Sphinx:

.. meta::
   :rel=me namespace=pypi.org package=example: pypi

produces

<meta content="pypi" namespace="pypi.org" package="example" rel="me" />

Note: it's going to be much harder to have multiple values in package separated by a space as it would break the meta directive parsing. It's easy to have multiple meta tags though:

.. meta::
   :rel=me namespace=pypi.org package=example: pypi
:rel=me namespace=pypi.org package=other: pypi

Also, it's impossible to not have a content element, as far as I can tell. But then maybe instead of package= we should list the packages in content, which would solve both issues at once:

.. meta::
   :rel=me namespace=pypi.org: example other
<meta content="example other" namespace="pypi.org" rel="me" />

On mkdocs, it looks like one would need to extend the theme, which is a bit cumbersome. But I'm sore someone would make a plugin soon enough to solve this (it's probably the same with sphinx, realistically)

woodruffw commented 1 month ago

Thanks for looking into that @ewjoachim! Using content for the package(s) makes sense to me, given that 🙂

facutuesca commented 2 weeks ago

I'm working on this. One thing that we should be aware of is that implementing this kind of verification, where each URL is accessed and parsed to see if it contains the meta tag specified above, means that warehouse will start doing a lot more outgoing requests.

I'm not sure of how many releases are uploaded per second (let's call it N), but assuming (conservatively) that each of those releases has 1 URL in its metadata, that means that PyPI will have at least N new outgoing requests per second, to arbitrary URLs.

We can have restrictions to reduce network activity and protect against DoS attacks (like what Mastodon does, limiting the response size to 1 MB), but we'll still need to handle all the new outgoing requests.

Since I don't know the number of releases uploaded per second, I'm leaving the question open here to see if PyPI's infrastructure can handle the extra network activity downloading those webpages would cause.

woodruffw commented 2 weeks ago

Yeah, thanks for calling that out! I think there are a few things PyPI could do to keep the degree of uncontrolled outbound traffic to a minimum:

PyPI could do some or all of them; I suspect limiting unique FQDNs might be a little too extreme compared to the others.

ewjoachim commented 2 weeks ago

Also, as far as I can tell, pypi.org's DNS point to fastly. If we were to easily know the IPs of the real servers beneath fastly, DDoS attacks could become easier. We need to make sure that the outgoing ip used to connect to the website is not same as the inbound IP for the website. It's probably the case as the worker run on different machines, but it's worth mentally checking this by anyone who knows the infrastructure well enough. (Also, ensure that inbound traffic that's not coming from fastly of firewalled, it's probably already the case but in case it's not, it's probably worth doing it)

Limit the number of unique FQDNs during verification, regardless of number of metadata URLs.

Also, limit the number of underlying IPs after DNS resolving, and/or domain names, because if a server has a wildcard DNS, we can generate infinite FQDNs that will hit the same server.

Oh, and we may want to ensure we put a reasonable timeout for the requests (for us).

Also, we may want to control (& advertise) the user agent we use to make those requests. Potentially also the outbounds IPs if possible. Some large players (RTD, github.io, ...) might found out we make a large number of requests to them that would fall in their own rate limit, but they might be inclined to put a bypass for us, and it's much easier if we make it clear how to identify our requests.

And maybe keep a log of all the requests we made and what release it's linked to ? Could we end up in trouble if someone makes us make a request to illegal content ? Could the PyPI IPs end up on some FBI watchlist ? (I wonder if using Cloudflare's 1.1.1.3 resolver "for families" that blocks malware & adult content could mitigate this risk... But I don't know if it's within the terms of use)

Oh, we also need to protect ourselves from SSRF, even though in this case we're not displaying what we requested back to the user, in the hopefully inexistant but possible case that an internal GET request can have side effect, this could be catastrophic. e.g. we're on AWS, if the user publishes a package with URL http://169.254.169.254/latest/meta-data/, then suddenly the page we manipulate contains an usable AWS token from the machine that made the request. This is not a problem as we're just going to disregard it due to not containing the meta tag (not tag at all, it's json). We need to make sure that the URL contains a proper domain (not an IP, not localhost)

Oh and MITM of course. We should only try to validate HTTPS urls, validating an HTTP URL would only lead to an untrustable result.

Just for completeness: if the page is cut due to being more than 1MB, but we still want to check the headers, we'll need a HTML parser that doesn't crash on partial content.

Should we request <domain>/robots.txt and do something in case it disallows us?

I guess this is the kind of question everyone should ask when they implement a webhook or a crawler or anything. There surely is a resource out there from people who have solved these headaches already.

woodruffw commented 2 weeks ago

Also, we may want to control (& advertise) the user agent we use to make those requests.

Yes, this makes a lot of sense to do!

And maybe keep a log of all the requests we made and what release it's linked to ? Could we end up in trouble if someone makes us make a request to illegal content ?

This log would presumably just be the list of URLs listed in the project's JSON API representation/latest release page on PyPI, no? I'm personally wary of PyPI retaining any more data than absolutely necessary, especially since in this case we're not actually storing any data from the URL, only confirming that the URL is serving HTML with a particular <meta> tag at a particular point in time.

Oh, we also need to protect ourselves from SSRF, even though in this case we're not displaying what we requested back to the user, in the hopefully inexistant but possible case that an internal GET request can have side effect, this could be catastrophic.

For SSRF, I think the main thing we'll need to do is prevent server-controlled redirects. In other words: if the URL itself doesn't serve the <meta> tag itself, we won't allow it to redirect us anywhere else. I don't think PyPI should worry about GETs being non-idempotent -- any web service that allows that is simultaneously thoroughly out of spec and would be immediately broken by the first spider to crawl the project on PyPI anyways 🙂

We need to make sure that the URL contains a proper domain (not an IP, not localhost)

I could be convinced that we should add this restriction as a practical matter, but I'm not sure it's that important in terms of security? If the URL has an IP as its host but otherwise matches the secure origin rules (i.e. HTTPS), is there a reason we shouldn't validate it?

Oh and MITM of course. We should only try to validate HTTPS urls, validating an HTTP URL would only lead to an untrustable result.

FWIW, this one at least is covered under "must be a secure origin" in https://github.com/pypi/warehouse/issues/8635#issuecomment-2289617568.

ewjoachim commented 2 weeks ago

For SSRF, I think the main thing we'll need to do is prevent server-controlled redirects. In other words: if the URL itself doesn't serve the tag itself, we won't allow it to redirect us anywhere else. I don't think PyPI should worry about GETs being non-idempotent -- any web service that allows that is simultaneously thoroughly out of spec and would be immediately broken by the first spider to crawl the project on PyPI anyways 🙂

The danger of SSRF is internal urls. The request will be made from within the PyPI infrastucture and may have access to network-protected endpoints that might not be accessible to random spiders

FWIW, this one at least is covered under "must be a secure origin" in https://github.com/pypi/warehouse/issues/8635#issuecomment-2289617568.

Ah you're right sorry

If the URL has an IP as its host but otherwise matches the secure origin rules (i.e. HTTPS), is there a reason we shouldn't validate it?

This could make sense indeed.

ewjoachim commented 2 weeks ago

This could make sense indeed.

Hm, thinking again, if someone uses https://10.0.0.1, this means we ARE going to make the request and if it just so happens that this IP is listening on 443, the request will go through, and we will evaluate the result. It's probably not a high attack vector, but I'm not at all comfortable that the PyPI server will be able to request any internal HTTPS URL (with a GET request where the path is the only thing controlled by the attacker)

ewjoachim commented 2 weeks ago

Oh, btw, should we make sure the port is not overridden (or force it ourselves to 443) ? I don't know if there are protocols out there where we could do nasty things just by opening a TCP connection. I hope not.

woodruffw commented 2 weeks ago

The danger of SSRF is internal urls. The request will be made from within the PyPI infrastucture and may have access to network-protected endpoints that might not be accessible to random spiders

Ah, I see what you mean. Yeah, I think the expectation here would be that we deny anything that resolves to a local/private/reserved address range. IP addresses would be allowed only insofar as they represent public ranges (and serve HTTPS, per above).

Oh, btw, should we make sure the port is not overridden (or force it ourselves to 443) ? I don't know if there are protocols out there where we could do nasty things just by opening a TCP connection. I hope not.

I think this falls under the "PyPI isn't responsible if your thing breaks after you stuff a URL to it in your metadata," but this is another datapoint in favor of making our lives easy and simply not supporting anything other than HTTPS + domain names + port 443, with zero exceptions 🙂

facutuesca commented 2 weeks ago

"Collate" verified URLs, i.e. don't re-perform URL verification if another file in the same release has already verified the URL within a particular time window (~15-60 minutes?)

What if we only perform this kind of verification once per release? As in, during the upload of the first file that creates the release.

The reason why we currently re-verify URLs for each file upload of the same release is because Trusted Publisher verification means that some file uploads might come from the Trusted Publisher URL and some not. So it makes sense to re-verify: the first file upload might not come from a relevant Trusted Publisher, but subsequent ones might.

However, this is not the case for this type of verification, since we're accessing resources independent of the upload process and authentication. So checking the URLs once during release creation might be a simple way of limiting the amount of requests we make.

ewjoachim commented 2 weeks ago

The reason why we currently re-verify URLs for each file upload of the same release is because Trusted Publisher verification means that some file uploads might come from the Trusted Publisher URL and some not. So it makes sense to re-verify: the first file upload might not come from a relevant Trusted Publisher, but subsequent ones might.

But the pages might change. I agree that once a page has been verified, it's probably fair to trust it for some amount of time, but if someone already has a URL set up, and learns about this feature and adds their meta tag and pushes a new version, we should recheck even if we've checked before.

facutuesca commented 2 weeks ago

but if someone already has a URL set up, and learns about this feature and adds their meta tag and pushes a new version, we should recheck even if we've checked before.

Yes, that's what I meant with my comment, we should do this kind of verification (meta tag) once per release:

So checking the URLs once during release creation might be a simple way of limiting the amount of requests we make.

Maybe the confusion is because I'm using "release" to refer to a new version of a package, so I'm saying we should recheck every time the user uploads a new version.

ewjoachim commented 2 weeks ago

Ah no my bad, you said it right, I misunderstood. It's just that as far as I had understood, we had dismissed the idea of re-verifying a link, so it was already what I thought we were at. So I thought you suggested once per project, but it's my bad :)