tinyrange / pkg2

Package Metadata Database
https://pkg2.thrush-map.ts.net/
Apache License 2.0
2 stars 0 forks source link

Improve package fetcher development iteration times #12

Closed Vbitz closed 2 weeks ago

Vbitz commented 3 weeks ago

Right now developing a package fetcher for a large repo is a absolute pain. You have to iterate using -forceRefresh -test to make sure the fetchers are updated each time.

The ideal would be to fetch package metadata in two stages. A first stage just downloads the indexes and writes out a file with all the data parsed into JSON. A second phase (that can run in parallel) loads all the metadata into the database. The first phase cached until the underlying file is invalidated and the second phase is run each time the index is loaded (or maybe it should be cached as well... something to test).

From a fetcher perspective this means the code will be split into two functions. A first function just emits each record and a second phase turns all the emitted records into packages using the normal current API.

The abstraction also means I can grab package metadata from individually downloaded files in the future (important for user provided RPMs and Debs).

A refactored version of the alpine fetcher looks like this.

def parse_alpine_package(ctx, url, repo, ent):
    pkg = ctx.add_package(ctx.name(
        name = ent["P"],
        version = parse_apk_version(ent["V"]),
        architecture = ent["A"],
    ))

    pkg.set_raw(json.encode(ent))

    pkg.set_description(ent["T"])
    if "L" in ent:
        pkg.set_license(ent["L"])
    pkg.set_size(int(ent["S"]))
    if "I" in ent:
        pkg.set_installed_size(int(ent["I"]))

    pkg.add_source(kind = "apk", url = "{}/{}-{}.apk".format(url, pkg.name, ent["V"]))
    if opt(ent, "c") != "":
        pkg.add_build_script("alpine", (ent["c"], "{}/{}/APKBUILD".format(repo, ent["o"])))

    pkg.add_metadata("url", opt(ent, "U"))
    pkg.add_metadata("origin", opt(ent, "o"))
    pkg.add_metadata("commit", opt(ent, "c"))
    pkg.add_metadata("maintainer", opt(ent, "m"))

    for depend in split_dict_maybe(ent, "D", " "):
        if depend.startswith("!"):
            pkg.add_conflict(parse_apk_name(ctx, depend.removeprefix("!")))
        else:
            pkg.add_dependency(parse_apk_name(ctx, depend))

    for alias in split_dict_maybe(ent, "p", " "):
        pkg.add_alias(parse_apk_name(ctx, alias))

def fetch_alpine_repository(ctx, url, repo):
    ctx.pledge(semver = True)
    ctx.defer(parse_alpine_package, (url, repo))

    resp = fetch_http(url + "/APKINDEX.tar.gz")

    if resp == None:
        return

    apk_index = resp.read_archive(".tar.gz")["APKINDEX"]

    contents = parse_apk_index(apk_index.read())

    for ent in contents:
        ctx.emit(ent)

At the start support for this will be opt-in.

I'm still not sure about this model so I'll give it a little more thought before I implement it.

Vbitz commented 2 weeks ago

This is now implemented with v2.