Better instructions / support for offline building

nh2 commented 1 year ago

Description

I'm trying to package pypdfium2 for NixOS / nixpkgs (the Linux distribution with currently by far the most packages in).

Like many other distros, the build system has the requirement that builds are hermetic, without network access.

For the beginning I'm trying to use the pre-built pdfium binaries provided.

But the build system (paths down from setup.py) still tries to connect to the network in various places:

downloading pdfium binaries
fetching tags with git
urlretrieve() in build_pdfium_bindings(); it has a headers_dir=None default argument, but no path in the code allows to set it to something not-None

I noticed the prepared! concept from the README, but not sure that's what's needed, as that seems to skip ctypesgen.

Would it be possible to provide setup.py with pre-downloaded files of whatever is needed to build a single pypdfium2 version?

For example, from what I've read so far, for version 6097 it seems it would be sufficient to provide:

If one could just provide those to setup.py with flags or environment variables, it would solve the hermetic distro packaging issue.

nh2 commented 1 year ago

Another thing that's not clear:

https://github.com/pypdfium2-team/pypdfium2/tree/a397ff7efa71c2fbb2200548bd9be7f30de84d7c#incompatibility-with-cpython-376-and-381

pypdfium2 built with mainstream ctypesgen cannot be used with releases 3.7.6 and 3.8.1 of the CPython interpreter due to a https://github.com/python/cpython/pull/16799#issuecomment-612353119 that https://github.com/ctypesgen/ctypesgen/issues/77 ctypesgen-created string handling code.

Since version 4, pypdfium2 is built with a patched fork of ctypesgen

The linked issues are fixed, and we're way past CPython 3.8.1, yet there's still

    assert getattr(ctypesgen, "PYPDFIUM2_SPECIFIC", False)

in the code -- is this intended, or should building with normal ctypesgen be supported again?

If no, the fork (https://github.com/pypdfium2-team/ctypesgen) should be linked in the README.

nh2 commented 1 year ago

Another point of confusion:

The README says

update_pdfium.py downloads binaries and generates the bindings.

and its parse_args() says

    parser = argparse.ArgumentParser(
        description = "Download pre-built PDFium packages and generate bindings.",
    )

But its main() seems to only download() and extract(), not generate any bindings.

mara004 commented 1 year ago

Thanks for your inquiry. Since a lot of questions are raised in this thread, it may take some time to answer it all.

First, note that there were some heavy changes to the setup infrastructure recently with the addition of conda packaging (PR #268), and the changes yet need some consolidation. Particularly, I still intend to add means of plugging in custom pdfium headers (and setting custom runtime libdirs) with the system target. Also, what you mention in https://github.com/pypdfium2-team/pypdfium2/issues/272#issuecomment-1793714510 (the incorrect desc) is an oversight from the change; bindings generation was externalized. I'll fix that soon.

I understand users may want an offline installation, and we have the wheel packages for that, but I'm afraid offline setup.py wasn't really an intended use case here. However, I believe it should already be doable with the prepared! prefix indeed (see new comment below). (We can also consider adding a new setup target to take a caller-provided binary+headers from a custom location if that helps.)

Concerning ctypesgen, the upstream issue is closed because CPython reverted the change, but ctypesgen's problematic bloated string code is still there. Upstream development is pretty stuck, and there are a lot more improvements in the fork, like the preventing of invalid struct fields (this can be very confusing, see https://github.com/pypdfium2-team/pypdfium2/discussions/221#discussioncomment-6710004) and a leaner and more controllable library loader. (See e.g. the commit message of https://github.com/pypdfium2-team/ctypesgen/commit/569dc4b9f697cb47e8dd76a6db078d13a9fcc701 for some of the technical debt in the old library loader.) We don't currently have a mind of adding back a compatibility layer for upstream ctypesgen. Note that the fork is in the Readme already - the link points to the fork, and the dependency section mentions ctypesgen (pypdfium2-team fork).

mara004 commented 1 year ago

What you could do now is to just copy in the offline-provided bindings and binary into src/pypdfium2_raw/. For convenience, you can use the reference bindings at autorelease/bindings.py to begin with, so you don't have to worry about ctypesgen and headers yet. The version file will have to be written manually (take a look at a generated example for the format). Then prepared!{platform_name}, e.g. prepared!linux_x64 should hopefully work.

(Update: No need to do this manually, I figured you can just use the emplace tooling, as long as we separate data files generation and the actual install.)

mara004 commented 1 year ago

I also took a quick look at your patch, but this is not an acceptable solution. If changes, then we should rather look at the higher-level setup.py/emplace.py/packaging_base.py instead of messing up update_pdfium.py.

mara004 commented 1 year ago

Actually, I believe you could just call ./run emplace {platform} in the "online allowed" stage of the package building to create the files, and then use prepared!{platform} in the offline-only stage.

You have to get the data files at some point, and instead of doing it manually you can just use the emplace tooling for that, while prepared! allows you to be hermetic on the actual install time.

nh2 commented 1 year ago

@mara004 You looked a bit too early, while I was still iterating on it. Please have a look now: PR #273

nh2 commented 1 year ago

I believe you could just call ./run emplace {platform} in the "online allowed" stage of the package building

@mara004 There is no "online allowed" stage in the normal packaging of hermetic/reproducible builds such as NixOS:

All inputs have to be downloaded before the packaged software's build system runs, inclusive of the hashes of the files to download, to guarantee reproducibility and content-addressed storage.

nh2 commented 1 year ago

we should rather look at the higher-level setup.py/emplace.py/packaging_base.py instead of messing up update_pdfium.py.

I agree; I figured this out as well while working on it. The PR from

was still iterating on it. Please have a look now: PR #273

implements this alternative approach.

mara004 commented 1 year ago

@mara004 There is no "online allowed" stage in the normal packaging of hermetic/reproducible builds such as NixOS:

All inputs have to be downloaded before the packaged software's build system runs, inclusive of the hashes of the files to download, to guarantee reproducibility and content-addressed storage.

Exactly, the inputs are downloaded. This is the "online allowed" prepare stage. Similar to how you are downloading the pdfium-binaries archive, you could just download the externally generated data files.

I believe you could create a wrapper repo where you force add the data files to a branch, push, and use that branch as source. The wrapper repo would have to be updated before you build a new version. This way, you could build pypdfium2 using prepared! without need for patches that are clearly not acceptable here.

nh2 commented 1 year ago

@mara004

patches that are clearly not acceptable here

Could you clarify what about them is not acceptable?

I implemented your suggestions from:

If changes, then we should rather look at the higher-level setup.py/emplace.py/packaging_base.py

I also believe that the changes in there carry some generally useful improvements, independent of hermetic builds. Please take a detailed look.

I believe you could create a wrapper repo where you force add the data files to a branch, push, and use that branch as source. The wrapper repo would have to be updated before you build a new version.

This will not be suitable to most Linux distributions; they prefer to rely on upstream repos of the original authors only, as you provide them. It will also provide a coordination bottleneck between distros.

mara004 commented 1 year ago

Hmm, I see the new patch's general idea is indeed better than before. I'll take some time to think over this again.

However, the patch is still changing the setup infrastructure in a way I don't appreciate. Offline setup just from a pdfium-binaries tarball is outside our intended use case, and not really something we would be willing to maintain or test in our codebase. I don't think it's a good idea to add secondary code passages just for awkward third-party uses without upstream benefit. Handling down the headers dir and binary tarballs, and processing them specially isn't nice. That said, I think the env vars are a bit misintegrated; they should be confined to a custom target to retain a pure auto-generating code path.

This will not be suitable to most Linux distributions; they prefer to rely on upstream repos of the original authors only, as you provide them. It will also provide a coordination bottleneck between distros.

You'd only use the repo to transfer the data files, without changes to the codebase itself. It doesn't have to be a repo, any externally prepared archive with the final data files as produced by our existing setup code would do.

FWIW the cleanest solution would be to just to allow the remote connection. This isolation requirement doesn't make any sense to me. (Which other distros imposing this as well?) If you can trust the upstream repo you can also trust the files it downloads. And it would still be possible to review the downloaded/generated files before publishing anything.

mara004 commented 1 year ago

Pragmatically reiterating on the idea from https://github.com/pypdfium2-team/pypdfium2/issues/272#issuecomment-1793748434, another alternative would be to write a downstream script to extract binary, version and headers solely from a pdfium-binaries tarball and call ctypesgen directly (or even just use the reference bindings we provide here), instead of messing with upstream code.

nh2 commented 1 year ago

Which other distros imposing this as well?

Most of them:

Debian

You should never assume that Internet access is available during building. Debian guarantees that every binary package can be built from the available source packages for licensing and security reasons.
Thus Ubuntu
Fedora

In no cases should source code be downloaded from any external sources
Thus RHEL
Gentoo

Internet access within the build procedure is forbidden

I believe it is like this already for decades; it's a fundamental of distro packaging.

Offline setup just from a pdfium-binaries tarball is outside our intended use case, and not really something we would be willing to maintain or test in our codebase.

I'm not sure I fully understand. The current code downloads a pdfium-binaries tarball, as a convenience for the user. Why is it wrong to make this convenience optional?

Handling down the headers dir and binary tarballs, and processing them specially isn't nice.

How about another alternative:

The build system could expect that the pdfium binary/headers need to be in a certain subdirectory.
The existing download convenience could then download it to that directory.
Alternatively, the Linux distro build system could pre-populate that directory.

In that case, no env vars would be needed, and the README would just have to tell the packager what that directory is.

Let me know!

nh2 commented 1 year ago

another alternative would be to write a downstream script to extract binary, version and headers solely from a pdfium-binaries tarball and call ctypesgen directly (or even just use the reference bindings we provide here),

This looks possible, too. The ctypesgen invocation does not look very large.

I haven't considered it so far, with the idea that the pypdfium2 build system is the "expert location" on how to invoke ctypesgen correctly (especially given that it is currently forked).

It didn't feel right that every Linux distro would have their own copy-pasted-around ctypesgen invocation instead a single source of truth.

But if upstream were to recommend such an invocation in the README as the official packaging approach, it would likely work for most distros as well.

mara004 commented 1 year ago

I don't think your directory alternative would work nicely, as we are already dividing directories by platform.

Please give me some time to consider the case. If changing pypdfium2 code, I'd prefer to write the implementation personally instead of taking a PR, to ensure a nice integration I'm willing to maintain.

I think a cleaner solution might be to add a new setup target PDFIUM_PLATFORM=caller, taking headers dir, path to the binary (not archive), and full version. This would allow us to confine the use case to a single, high-level handler. Taking a caller-provided pdfium-binaries archive and handling it down into update_pdfium causes pollution; it would be cleaner and more universal to let the caller extract the archive and point to the paths.

nh2 commented 1 year ago

path to the binary (not archive) ...

Taking a caller-provided pdfium-binaries archive and handling it down into update_pdfium causes pollution; it would be cleaner and more universal to let the caller extract the archive and point to the paths.

Note, this is exactly what my PR already does. It takes the path to the caller-extracted binary directory.

As mentioned in https://github.com/pypdfium2-team/pypdfium2/issues/272#issuecomment-1793809681, you were too fast for me :) You looked at my patch too early. The env var that took an archive is long gone, it was only present while I was iterating on the patches, before I created the PR.

Please do check the the current content of the PR, otherwise we may not be talking about the same code.

and full version

Also, Pdfium binary contain a VERSION file, with the full version content that you're currently fetching via git:

MAJOR=120
MINOR=0
BUILD=6097
PATCH=0

The PR adds functionality to handle those, so that git might be unnecessary in some (many?) cases, which might be cleaner.

I think a cleaner solution might be to add a new setup target PDFIUM_PLATFORM=caller

This sounds fine to me.

I used the existing PDFIUM_PLATFORM=linux_x64:1234 approach because I didn't want to make assumptions on whether or not the setup.py code may want to add logic that is conditional on the version number (1234 in my example), so I didn't want to hide this information from it.

But afterwards I discovered that VERSION file mentioned above, so theoretically PDFIUM_PLATFORM=caller would work just as well, because one could require that the caller ensure that this VERSION file exists, so that setup.py can still discover the full version even if it's not given in the PDFIUM_PLATFORM env var.

Please give me some time to consider the case. If changing pypdfium2 code, I'd prefer to write the implementation personally instead of taking a PR, to ensure a nice integration I'm willing to maintain.

Sounds great!

Feel free to pick anything from the PR you find useful. There are multiple commits in it; the earlier ones have some minor cleanup contributions.

mara004 commented 1 year ago

Sorry, I did view the PR, but was a bit too tired and mixed up the different states. You're right it's not an archive, but still PDFIUM_BINARIES_DIR, which is not identical to a direct binary path, either.

I'm afraid I might not be able to continue on this in the near term, but I'll resume it when I have more time and am less stressed.

mara004 commented 1 year ago

Also, Pdfium binary contain a VERSION file, with the full version content that you're currently fetching via git: The PR adds functionality to handle those, so that git might be unnecessary in some (many?) cases, which might be cleaner.

I'm not sure if it really is cleaner. We already need the git ls-remote for sourcebuild (build_pdfium.py), so handling the version file of pdfium-binaries would just incur another code passage to maintain with no obvious benefit for us.

Also I'm reluctant to rely on the extra files added by pdfium-binaries. I've been moving away from them in the past, as retrieving the info directly is more universal. Until recently we used the headers from pdfium-binaries and built bindings for each platform separately, but when adding the system target and reference bindings, we had to source the headers from pdfium directly, so I figured it would be best to use this strategy for all platforms, and cache bindings independently with a single source.

nh2 commented 1 year ago

I'm afraid I might not be able to continue on this in the near term, but I'll resume it when I have more time and am less stressed.

Sure -- no stress!

I'll share my packaging WIP meanwhile, in case it is interesting for you or other distro packagers:

mara004 commented 1 year ago

FYI, I have reached an early decision after all: https://github.com/pypdfium2-team/pypdfium2/discussions/274

pypdfium2-team / pypdfium2

Better instructions / support for offline building #272

Description