pyodide / micropip

A lightweight Python package installer for Pyodide
https://micropip.pyodide.org
Mozilla Public License 2.0
68 stars 16 forks source link

Add a common class to store JSON / Simple API response data #71

Closed ryanking13 closed 11 months ago

ryanking13 commented 12 months ago

This adds an intermediate class ProjectInfo (I can't think of a better name... open to other names) that parses both JSON and Simple API PyPI responses. micropip now takes this ProjectInfo class instead of a raw PyPI JSON response when searching for packages

This is a preparation to make micropip support both JSON and Simple API.

Related: #62

ryanking13 commented 11 months ago

@rth Okay, I think this is ready to be reviewed.

I cached sys_tags and parse_wheel_filename since those functions were bottlenecks when I checked.

In terms of performance, from_simple_api is slower than from_json_api. This is because simple API does not have a version key in the response so we need to parse the wheel filename for every file. I added a few checks to optimize the speed but it is still around 2 times slower.

For example, this is a result when I tested with numpy, which has ~2000 files in the index:

from micropip.package_index import ProjectInfo
from pathlib import Path
from gzip import decompress
import json
import timeit

numpy_json = json.loads(decompress(Path("tests/test_data/pypi_response/numpy_json.json.gz").read_bytes()))
numpy_simple = json.loads(decompress(Path("tests/test_data/pypi_response/numpy_simple.json.gz").read_bytes()))
num_runs = 100

def run_json():
    ProjectInfo.from_json_api(numpy_json)

def run_simple():
    ProjectInfo.from_simple_api(numpy_simple)

print("Total number of files:", len(numpy_simple["files"]))
print(f"Parsing JSON API for {num_runs} times:", timeit.timeit(run_json, number=num_runs), "seconds")
print(f"Parsing Simple API for {num_runs} times:", timeit.timeit(run_simple, number=num_runs), "seconds")
Total number of files: 2185
Parsing JSON API for 100 times: 0.11372180000762455 seconds
Parsing Simple API for 100 times: 0.243828200007556 seconds
rth commented 11 months ago

I cached sys_tags and parse_wheel_filename since those functions were bottlenecks when I checked

Thanks it's good to know! In what remains it looks like Version.__str__ and Version.__hash__ (e.g. when using as a dict key) is also non negligible (run with pyinstrument),

Screenshot 2023-07-11 at 15 34 12

So if we want to go further one day, maybe it could be micro-optimized a bit more.. Though individually it's already quite fast,

In [1]: from packaging.version import Version

In [2]: v = Version("1.2.0")

In [3]: %timeit hash(v)
423 ns ± 2.47 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [4]: %timeit str(v)
650 ns ± 2.04 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [5]: %timeit str("1.2.0")
12.2 ns ± 0.0178 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)
ryanking13 commented 11 months ago

Thanks it's good to know! In what remains it looks like Version.str and Version.hash (e.g. when using as a dict key) is also non negligible (run with pyinstrument),

Right, converting string <--> Version takes quite a time... I'm not sure how many people will have performance issues with this (especially given that network latency is a much larger part of it), but it's definitely something we can optimize for. By the way, pyinstrument is cool. Thanks for introducing an awesome tool :)

ryanking13 commented 11 months ago

Thanks for the reviews!