python / pythondotorg

Source code for python.org
https://www.python.org
Apache License 2.0
1.5k stars 594 forks source link

Improve json api search. #2479

Closed pconesa closed 2 months ago

pconesa commented 3 months ago

Maybe it is there and not easy to find.

We have a package/plugin based application that uses pypi API to discover packages.

For this we have 1st:

our own page that lists allowed plugins, returns kind of ["package1", "package2"].

Note "package1" and "package2" are valid pypi packages.

Now, having this we would like to access the latest version of this package but we do not know it so we have to go for the long and extense: https://pypi.org/pypi/package1/json and parse all content there per available plugin several tens of them (about 50?)

Performance is bad.

Is there a way to get just the latest metadata of a package without first loading https://pypi.org/pypi/package1/json?

I'm aware of this:

https://pypi.org/simple/ but it is not json based and has no filtering option?

For us something like this would work:

https://pypi.org/pypi/package1/latest/json

Note latest would be literal, meaning that latest version available.

vijulondhe commented 2 months ago

Hi everyone,

I understand the challenge of fetching the latest version metadata for multiple packages from PyPI without hitting performance issues due to the extensive JSON data.

While PyPI currently does not provide an endpoint like https://pypi.org/pypi/package1/latest/json, you can improve performance by using parallel requests and caching. Here are some strategies and examples to help: 1. Parallel Requests You can use the aiohttp library to make asynchronous HTTP requests, which allows you to fetch metadata for multiple packages in parallel. This reduces the overall wait time for responses. ` import aiohttp import asyncio

async def fetch_package_data(session, package_name): url = f"https://pypi.org/pypi/{package_name}/json" async with session.get(url) as response: data = await response.json() latest_version = data["info"]["version"] return {package_name: latest_version}

async def fetch_all_packages(package_names): async with aiohttp.ClientSession() as session: tasks = [fetch_package_data(session, pkg) for pkg in package_names] results = await asyncio.gather(*tasks) return results

def get_latest_versions(package_names): return asyncio.run(fetch_all_packages(package_names))

List of packages

packages = ["package1", "package2", "package3"] latest_versions = get_latest_versions(packages) print(latest_versions) ` 2. Caching

To avoid redundant API calls, you can implement a caching mechanism. Here’s an example using the cachetools library: ` from cachetools import cached, TTLCache import requests

Create a cache with a TTL of 1 hour

cache = TTLCache(maxsize=100, ttl=3600)

@cached(cache) def get_package_version(package_name): url = f"https://pypi.org/pypi/{package_name}/json" response = requests.get(url) data = response.json() return data["info"]["version"]

List of packages

packages = ["package1", "package2", "package3"] latest_versions = {pkg: get_package_version(pkg) for pkg in packages} print(latest_versions) ` These approaches should help in significantly improving the performance of fetching the latest version metadata for a large number of packages. I hope this helps! If you have any questions or need further assistance, feel free to ask.

hugovk commented 2 months ago

This issue tracker is for the python.org website.

The issue tracker for pypi.org is at https://github.com/pypi/warehouse/.

But this would be better asked in the Python Help category at https://discuss.python.org/c/users/7.

Please close this issue and ask there or at Stack Overflow.

Mariatta commented 2 months ago

Closing because it is not relevant for this repo.