jqnatividad / qsv

Blazing-fast Data-Wrangling toolkit
https://qsv.dathere.com
The Unlicense
2.52k stars 71 forks source link

Consider releasing linux assets with `.tar.gz` instead of `.zip` + misc improvements #2145

Open polarathene opened 2 months ago

polarathene commented 2 months ago

Is your feature request related to a problem?

On linux the .zip archive format is not as desirable for publishing releases (GH release assets).

.zip:

# Extracted as a variable to focus on the rest of the command:
QSV_RELEASE_URL=https://github.com/jqnatividad/qsv/releases/download/0.134.0/qsv-0.134.0-x86_64-unknown-linux-musl.zip

curl -o qsv.zip -fsSL "${QSV_RELEASE_URL}" && unzip -d /usr/local/bin qsv.zip qsv && rm qsv.zip

vs .tar.gz:

# NOTE: URL adjusted for complimentary version drop from asset name, enabling deterministic `latest` download URL.
# NOTE: `--no-same-owner` used as a GH release will typically have `1001:127` UID + GID
# (for non-root users this option is the default)
QSV_RELEASE_URL=https://github.com/jqnatividad/qsv/releases/latest/download/qsv-x86_64-unknown-linux-musl.tar.gz

curl -fsSL "${QSV_RELEASE_URL}" | tar -xz --no-same-owner -C /usr/local/bin qsv

Additional justification:

Describe the solution you'd like

Publishing assets for linux with .tar.gz as most projects do on GH releases for this platform.

Also consider:

Describe alternatives you've considered

No major issues, for manual downloads I can grab the latest URL from the release process with a few more clicks, then add a package in environments that need it. It would be nicer to not need to think about it though :)

I had also tried to leverage Cargo Binstall but got an error due to the .zip format not being as well supported.

At one point I did try building from source (this was problematic due to build requirements being ridiculous to try a CLI program out, over 10GB of memory was used).

Additional context

At this point changing the asset name (either by extension or complimentary version omission) is technically a "breaking" change that'd affect any automated processes (once they version bump at least). Technically since QSV is still on 0.x.y, whenever x increases breaking changes are permitted, so that is your call πŸ‘

GoReleaser could help with the archive by platform difference if you were to pursue this change.


UPDATE: I see that your CI is reliant upon the release tag in the asset name, so that may be a blocker for dropping the version name:

https://github.com/jqnatividad/qsv/blob/2652d76504a70a9856e7010bd482ce73cbac9dba/.github/workflows/publish-linux-qsvpy-glibc-231-musl-123.yml#L166-L172

BTW, you could probably simplify the build process a bit (especially for the lower glibc requirement and musl build support) by using Zig (see cargo-zigbuild).

jqnatividad commented 2 months ago

@polarathene , thanks for detailed, considered request.

I'll definitely look into your suggestions to improve the publishing/distribution of qsv. I agree that tar.gz is a more common format in *nix environments, however, there are several extenuating circumstances that need to be considered:

Just the same, as self_update and zipsign do support the tar.gz format, I'll investigate your request in more detail and look into your other recommendations as well.

FYI, we primarily target the latest Ubuntu LTS on the x86_64 platform, which is glibc-based, as that is our standard deployment platform for our CKAN PaaS service. I don't get to exercise the musl build as much, so we depend on community feedback to improve it.

polarathene commented 2 months ago

Thanks for sharing those insights, very informative 😁


  • qsv uses the self_update crate for its self-update feature. The naming convention of the zip archives was primarily dictated by self_update's naming requirements, so this is indeed a blocker per your update.

You could probably workaround that with an alternative host for releases? I know some other projects like Caddy have open-source plans with Cloudsmith (linked to related Github Action) for publishing releases either as packages or raw files/archives.

EDIT: Ah, seems like self_update has a limited set of "backends", so they'd need to add Cloudsmith I guess if you were to consider that.

The intent was Cloudsmith would be the place your versioned archives would be stored for self-update functionality, and GH releases would be more akin to other projects GH releases πŸ˜…


  • With tar.gz, wouldn't it necessarily need to untar and decompress ALL the variants, before getting the desired variant?

I've not inspected to check if it's decompressing the entire archive, but I really don't see that being a pragmatic concern for most?

With .tar.gz I can still extract just the file of interest as the example with tar command above shows. The benefit is that I don't need to write the archive to disk and remove it after, I can just pipe it as the memory required is minimal that writing a temporary copy to disk seems redundant.

  • this "selective decompression" may very well have been an "incidental" feature of the self_update crate for zip, as it decompresses the binary by name - allowing me to pack all the variants in one big archive per platform.

I think that's more convenient on your end than the users though? Most only need a single variant, but instead need to download the whole archive redundantly to get that.

No bandwidth fees with Github on your end, so I can understand why that's fine. For users it also minimizes searching through the list of links to find one they're interested in, although with automation or self-update that's often a 1 time only benefit.

  • self-update automatically removes the zip archive after it runs.

I assume only when it's pulling an update? Not the original archive? I didn't check that, but since I renamed the archive to make it simpler to extract qsv via CLI it wouldn't know what to remove anyway πŸ€·β€β™‚οΈ


  • qsv's release tempo is quite high. That's why self-update is essential, as it really simplifies the process of getting the latest version once you go through the initial installation of the prebuilt binaries.

I understand the value of self updating for some users, but it's not something I want myself, it rarely is on linux (we have package managers for this purpose). Last thing I would want is to run some project that had some qsv commands with a binary that self-updated itself with a breaking change I wasn't aware of (assuming this project is 6-12 months or so old), and then the functionality that should have worked is broken requiring time to be diverted (like a forced windows update).

I am aware of some distros having packages (often by community) that may pull from a pre-built release from official channels rather than doing a local build on the client as opposed to more official package repos per distro where they're built from source. Likewise in my case with Docker, adding qsv into a projects image from GH release is nicer than the 10GB memory + time to build qsv in CI. These are scenarios where self update functionality is not expected.


  • we zipsign the archives for authenticity, and self_update supports zipsigned archives.

You can accomplish the same with other formats, but I understand the choice with zip and self-updating features. My intention isn't to burden you with further maintenance, just to raise awareness of where some minor friction is for a different type of user.

EDIT: FWIW self_update has zipsign support (and archive support) for tar.gz.

cargo binstall does have it's own support for verifying signed assets. That said it should work well once a dependency resolves the zip archive compatibility issue (caused by multiple files in the archive I think?), which AFAIK has been done just not released for over 6 months, but this would have been a non-issue with tar.gz.


FYI, we primarily target the latest Ubuntu LTS on the x86_64 platform, which is glibc-based, as that is our standard deployment platform for our CKAN PaaS service. I don't get to exercise the musl build as much, so we depend on community feedback to improve it.

Static musl builds work on glibc systems (you usually can't have proper static glibc). For qsv though there may be a difference in performance, it'd need to be benched.

My comment that you responded to though was about leveraging Zig to compile both glibc and musl builds from the same build host with the added benefit of not requiring any musl related deps:

https://github.com/jqnatividad/qsv/blob/2652d76504a70a9856e7010bd482ce73cbac9dba/.github/workflows/rust-musl.yml#L34-L37

Your glibc min version will be from the build host:

https://github.com/jqnatividad/qsv/blob/2652d76504a70a9856e7010bd482ce73cbac9dba/.github/workflows/rust.yml#L20

Presently ubuntu-latest resolves to Ubuntu 22.04, once that switches over to 24.04, the glibc min version support will be implicitly raised and anyone on older glibc will likely open issues about errors running qsv.

With Zig (via cargo zigbuild) you can specify min version of glibc you want to support instead where you would not be affected by such. Typically what most projects do before Zig is to build on older distro releases, but that is sometimes with drawbacks from the older software (some projects would use distro releases from 5 years or more to get broader compatibility with glibc).

So there's potential with zig to simplify your workflows / CI.

jqnatividad commented 2 months ago

Thanks @polarathene for your detailed feedback.

I'll take your recommendations under further consideration as we refine the publishing workflow.

And thanks for pointing out that I should explicitly use ubuntu-22.04 instead of ubuntu-latest.

As for qsv, my number one goal for the project is to be the fastest csv data-wrangling kit - thus the aggressive MSRV policy, taking advantage of the latest language features, the latest dependencies, etc.

One big ticket item that I haven't taken on with a big performance payoff is profile guided optimization (https://github.com/jqnatividad/qsv/issues/1448). Given the number of binaries/platforms we support, I can only do that for select platforms (starting with qsv pro).

I'll leave it to package maintainers should they choose to distribute qsv to fine-tune it to their requirements.

As for self-update, it's gated behind the self-update feature, so you can easily build qsv without it in your Docker images. As a further safety precaution, the actual self update only works with the prebuilt binaries. If you compile from source even with the self-update feature enabled, it will only alert you to new releases and will not actually apply self-updates.

And even if you choose to use the pre-builts in your Docker image, you can set the QSV_NO_UPDATE environment variable so it won't even check GH for new releases.

Finally, self-update is not automatic. You have to explicitly opt-in to update.