Closed ryanking13 closed 11 months ago
@rth Okay, I think this is ready to be reviewed.
I cached sys_tags
and parse_wheel_filename
since those functions were bottlenecks when I checked.
In terms of performance, from_simple_api
is slower than from_json_api
. This is because simple API does not have a version key in the response so we need to parse the wheel filename for every file. I added a few checks to optimize the speed but it is still around 2 times slower.
For example, this is a result when I tested with numpy
, which has ~2000 files in the index:
from micropip.package_index import ProjectInfo
from pathlib import Path
from gzip import decompress
import json
import timeit
numpy_json = json.loads(decompress(Path("tests/test_data/pypi_response/numpy_json.json.gz").read_bytes()))
numpy_simple = json.loads(decompress(Path("tests/test_data/pypi_response/numpy_simple.json.gz").read_bytes()))
num_runs = 100
def run_json():
ProjectInfo.from_json_api(numpy_json)
def run_simple():
ProjectInfo.from_simple_api(numpy_simple)
print("Total number of files:", len(numpy_simple["files"]))
print(f"Parsing JSON API for {num_runs} times:", timeit.timeit(run_json, number=num_runs), "seconds")
print(f"Parsing Simple API for {num_runs} times:", timeit.timeit(run_simple, number=num_runs), "seconds")
Total number of files: 2185
Parsing JSON API for 100 times: 0.11372180000762455 seconds
Parsing Simple API for 100 times: 0.243828200007556 seconds
I cached sys_tags and parse_wheel_filename since those functions were bottlenecks when I checked
Thanks it's good to know! In what remains it looks like Version.__str__
and Version.__hash__
(e.g. when using as a dict key) is also non negligible (run with pyinstrument),
So if we want to go further one day, maybe it could be micro-optimized a bit more.. Though individually it's already quite fast,
In [1]: from packaging.version import Version
In [2]: v = Version("1.2.0")
In [3]: %timeit hash(v)
423 ns ± 2.47 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [4]: %timeit str(v)
650 ns ± 2.04 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [5]: %timeit str("1.2.0")
12.2 ns ± 0.0178 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)
Thanks it's good to know! In what remains it looks like Version.str and Version.hash (e.g. when using as a dict key) is also non negligible (run with pyinstrument),
Right, converting string <--> Version takes quite a time... I'm not sure how many people will have performance issues with this (especially given that network latency is a much larger part of it), but it's definitely something we can optimize for. By the way, pyinstrument is cool. Thanks for introducing an awesome tool :)
Thanks for the reviews!
This adds an intermediate class
ProjectInfo
(I can't think of a better name... open to other names) that parses both JSON and Simple API PyPI responses. micropip now takes thisProjectInfo
class instead of a raw PyPI JSON response when searching for packagesThis is a preparation to make micropip support both JSON and Simple API.
Related: #62