pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.6k stars 963 forks source link

XMLRPC statistics on "abusive" requests #9136

Closed abitrolly closed 1 month ago

abitrolly commented 3 years ago

What's the problem this feature will solve?

An ongoing 2 months outage of XMLRPC search reported by https://status.python.org/incidents/grk0k7sz6zkp can be solved by optimizing or caching popular queries.

Describe the solution you'd like

I'd like to see the volume and contents of the:

Additional context

Depending on the statistics, it will be possible to provision additional index servers to offload API requests. or provide a way for organization to incrementally sync the database. Sync can be done either using global event notifications similar to Fedora Messaging System, or using standard P2P Merkle tree lookup mechanism employed by blockchains.

ewdurbin commented 3 years ago

XMLRPC call rate over time-4

Our current attempted call rate for the disabled search endpoint is roughly 100rps (yellow trace). All of these are receiving either a rate limit response (brown trace) or a disabled response (red trace). This call rate has not changed since we implemented rate limiting or disabled search.

The issue isn't solely one of provisioning resources to sustain the search volume, it is that we don't have any viable mechanism to communicate with users of the very expensive XMLRPC API who abuse the endpoint. Architecturally XMLRPC being based on POST requests, combined with the high cardinality of results (search queries are arbitrary), makes caching this at the CDN edge or otherwise reducing the load imposed on our backends untenable in the long run.

Our current search is based on ElasticSearch, which I'm not familiar enough with to determine if such incremental syncs are viable.

abitrolly commented 3 years ago

@ewdurbin it is possible to publish stats by popularity on these 150rps without doing the actual requests? Without it we can only state that optimization in general sense is impossible.

ewdurbin commented 3 years ago

popularity in what sense?

abitrolly commented 3 years ago

Structure or request, which query, how popular are such queries. Then it will be possible to determine overhead for certain query structures and set selective filters to cut expensive requests and optimizing most popular more.

di commented 3 years ago

How do you propose to "set selective filters to cut expensive requests" and how would that be less expensive than the current response?

abitrolly commented 3 years ago

Filters can be set at load balancer, at web server, at middlewire or at Django level. It might be possible to set them at SQL level is SQL can explain that the query is too expensive to be run. Whatever method is chosen, it depends on metrics. The best way is to add OpenTracing of course. Maybe the "abusive" requests are just malformed XML that make parser choke.

ewdurbin commented 1 month ago

XLMRPC search has now been disabled for over three years and is not going to be re-instated. We have further disabled additional endpoints via the efforts of #16642. Given this, I am going to close this issue as our path forward is less to determine/mitigate specific patterns and more to establish new endpoints that are more readily cacheable and deprecate/disable remaining endpoints.