XZ compression support - Githubissues

DeeDeeG commented 5 years ago

Hi again.

I recently made the pull request to add an xz compression option to n. I noticed NVH doesn't have this option at the moment.

I wondered about the way to do it "properly," since it's a bit obscure the way it's implemented in n right now. As such, I decided to open an issue and not jump straight to a pull request this time.

The feature could be added similarly to how it was done over at tj/n:

Simply read the variable NVH_USE_XZ from the environment, and decide to use xz or gzip based on that

But I also thought about the following:

Whether to make xz compression the default (keeping gz as a fallback)
Whether to specify compression algorithm (xz or gz) with a command-line argument instead of an environment variable
- Or even to implement both ways of specifying compression?

Overall I figured it would be a low-pressure way to figure out doing xz compression the "right way" to do so in this repo. As a fork, there is a bit more freedom. And I suppose whatever solution is decided here can probably be ported to tj/n relatively easily if desired.

If you have thoughts on the preferred way to do this, I would be interested in working on and/or collaborating on an implementation.

Best regards.

DeeDeeG commented 5 years ago

I made some example implementations:

Simply replace all use of gzip with xz
- (Separate approach compared to the following implementations.)
Read an environment variable "NVH_USE_XZ"
- (Basically the same as how it was done at tj/n)
Process the command-line arguments "-xz", "--use-xz" or "--no-use-xz"
- (This implementation builds off of the previous example implementation, by setting or unsetting "NVH_USE_XZ" internally. This setting/unsetting is local to the script; It is ephemeral, only lasting the duration of the script, and the true environment variable is preserved as-is in the environment, if the user has it set somewhere. The command-line argument overrides/takes precedent over the environment variable if it has been set. Which IMO means it works as expected.)

Side note: I found working with the command-line arguments section of the script pretty easy, so I'm thankful it is set up so neatly to begin with!

shadowspawn commented 5 years ago

Hi @DeeDeeG

Whether to make xz compression the default (keeping gz as a fallback)

Making xz the default is my preferred method, if it is robust enough and not too complicated!

Looking at other managers:

nvs has an apparently undocumented NVS_USE_XZ environment variable, like n now has.
nvm-sh defaults to xz if the tool is present and the version is known to have an xz tarball in nvm_supports_xz

I do not want all the complicated version testing like nvm does, thinking just the >4 test. Older downloads can always use gzip, more recent versions can use xz if available.

DeeDeeG commented 5 years ago

Thanks for the reply.

Here is an implementation with those checks:

https://github.com/DeeDeeG/nvh/compare/xz-commandline-argument...DeeDeeG:xz-with-sanity-checks

I did a check similar to the one in nvm-sh to see if the xz command is available on the system.
Another check to see if the Node version to be downloaded is greater than 3. (I can change this to "greater than equal to 4," which is a bit more intuitive/readable, I suppose.)

If both of these pass, xz use is allowed. I also added NVH_USE_XZ="true" at the top, in the setup phase, which seemed like the best way to tell NVH to use xz by default.

My thoughts regarding this implementation:

Perhaps it makes sense not to have command-line options/switches for xz versus gzip. I'm not sure why people would use them, or if it's worth this being end-user configurable, since most people just want the smaller tarball, and it's not like we keep the tarballs around (I don't think?), i.e. we are downloading and immediately consuming/discarding the tarball. Not sure whether this really needs end-user configuration, since xz is basically better, where it's available and the system can extract it.
Even when setting xz to be used by default, it seems prudent to have gzip be the default, general case internally, and xz be the "special case" activated after passing checks. I say this because, if something goes wrong with the checks, it seems saner to bail out to using gzip, rather than the inverse, bailing out to xz if the checks themselves fail to work properly. As such, in my implementation, xz is mostly plumbed in as if it is the "special case" (despite actually being the default, and probably 99%+ of node.js downloads will use it these days).
- Unless the xz program has a different name than xz on other platforms than Linux, I think the tests are good and robust. So hopefully there will never be a need to anticipate the tests breaking. Just in case, though.
This looks to be pretty much done, and it;s ready for a PR and final couple of round of review I think.

DeeDeeG commented 5 years ago

Update about macOS:

On recent macOS, the xz program (as such) is not present on the system on a fresh install. But oddly, their gunzip utility can extract xz archives. (See this StackExchange discussion). Even though gunzip on other platforms cannot. (Refer to man gunzip in a Linux machine and a macOS machine to see the differences; macOS has several other non-gzip algorithms rolled in, whereas Linux does not.)

In any case...

curl [URL] | tar -Jx will work, seemingly irrespective of using the -J flag.

(Oddly, and mildly off-topic/a bit of a sidebar: downloading the xz archive and running tar -Jx [path/to/archive.xz] doesn't work. But a generic tar -xf [path/to/archive.xz] does work.)

I don't think macOS tar properly knows what to make of the -J flag, or how to properly understand that what an xz archive is locally on disk... But can extract it if told "this is the archive I want to extract" with the -f flag, or if piped in via stdin.

I don't 100% understand why it all behaves that way, but the behavior has been mapped out. This requires a bit of extra handling, so I will think about it and look for a good solution for the "supports xz" test. I am thinking of blanket whitelisting macOS as supporting xz, but this is probably not totally accurate? Not sure how far back macOS tar can decompress xz.

Edit: Came back to add a link/source

shadowspawn commented 5 years ago

I had noticed the Mac setup when we were looking at adding the environment variable to n (https://github.com/tj/n/pull/571#issuecomment-505679227) and didn't work out the details. I have xz installed via homebrew as a dependency of something else.

You can see why I was happier initially to add xz as an env var than on by default! 🙂

I did a bit of research and found tar mentions libarchive, and that lists support for xz from 2009. I have not tied that back to tar and Mac and Linux et al, so can't draw any conclusions yet.

https://github.com/libarchive/libarchive/wiki/ReleaseNotes#libarchive-270

DeeDeeG commented 5 years ago

Short summary of this comment: It looks like we can enable xz on macOS 10.7 and above. macOS version can be checked on command-line with a utility: sw_vers -productVersion

Found some info about OS X and libarchive versions. According to this: https://opensource.apple.com/

OS X 10.6.x shipped with libarchive 2.6.2. And OS X 10.7.x shipped with libarchive 2.8.3.

(2.8.3 is still used in Mojave, albeit with various, minor-looking, darwin-specific patches accumulated over the years if I understand correctly. For example, see libarchive.plist from macOS Mojave 10.14.5 for details.)

So without being able to test a bunch of old mac computers, I'd speculate the correct cutoff is OSX 10.7 and above can use xz, 10.6 and below can't.

Implementation note: It looks like the user's macOS/OSX version can be read on the command-line in a script-friendly way: https://www.cyberciti.biz/faq/mac-osx-find-tell-operating-system-version-from-bash-prompt/

Like so: sw_vers -productVersion output sample:10.11.1

I also poked around in the libarchive source history, and found references to unxz as an xz decompression provider in some cases, and the bsdtar command (separate from GNU tar). And found out that the -J flag is documented via tar --help, despite it not being in the man pages (man tar).

On a mac running El Capitan, I found the following:

unxz isn't installed by default, so we don't have to think about that
tar is a symlink to bsdtar (from libarchive)
confirmed that libarchive is version 2.8.3 in El Capitan
confirmed that sw_vers works as expected
Update: confirmed curl [archive.xz] | tar -Jx works as expected (as previously tested on Mojave)
Mildly off-topic: gunzip can actually decompress xz compression, but you still have to use tar to extract.

shadowspawn commented 5 years ago

Good digging, thanks.

I did some digging tonight and came across a few interesting links. I have not formed conclusions yet, but starting to wonder whether full "properly" is too hard! Interesting that node are talking about reducing number of options.

xz missing on a RedHat: https://github.com/nvm-sh/nvm/issues/1255#issuecomment-252682754
drop .gz for Linux, drop .xz for Darwin: https://github.com/nodejs/build/issues/1584
node adding .xz: https://github.com/nodejs/build/issues/164
node dropping xz (rejected): https://github.com/nodejs/build/issues/748

DeeDeeG commented 5 years ago

Mini update to the update about macOS:

As this StackOverflow answer rightly points out, the xz/lzma feature of bsdtar is configurable at compile time. The feature is present back to macOS 10.7, but it is configured off until macOS 10.9.

So I'd say move the cutoff version to macOS 10.9 and above. (Wish I had an array of old macs to test this on!)

I can test a bunch of old Linux releases, so once the more theoretical/research parts are done, I plan to do that (test a bunch of Linux distros against the draft PR) and confirm this works.

I don't feel that working this out for SunOS or AIX is worth it, or feasible, since as far as I can see, ~~I can't test either.~~ Just did a quick search, and you can actually download SmartOS, which is what Node really means by SunOS, apparently. So I might give that a shot.

Regarding AIX: There are no xz tarballs for AIX. e.g. this is the only AIX tarball for node v12.9.1: node-v12.9.1-aix-ppc64.tar.gz) So xz is off the table for AIX. But we can tell that the arch for AIX should always be "ppc64". That could be addressed in a separate PR.

node are talking about reducing number of options.

We can always do is_ok "${xz-url}" to make sure it exists. I actually have a WIP implementation of this, but it's rather messy. Waiting on more research before tidying it up. (Will post as a draft if requested to though. I don't mind posting early/often for review.)

I have not formed conclusions yet, but starting to wonder whether full "properly" is too hard!

That's what I presumed when I began, but I just figured I'd give it a good try and see how far I get. I'm happy with progress so far.

DeeDeeG commented 5 years ago

Here are the work-in-progress implementations:

The "debug" version with lots of echoes and warning/error messages
- https://github.com/DeeDeeG/nvh/compare/xz-environment-variable...DeeDeeG:xz-wip-debug-messy
Neatened up version
- https://github.com/DeeDeeG/nvh/compare/xz-environment-variable...DeeDeeG:xz-wip-2

DeeDeeG commented 5 years ago

Short summary: Linux looks good with these changes.

(I know that 's a broad thing to say, but I did test quite a few distros. All Linux distros that I tested use GNU tar. GNU tar uses xz from the PATH by default (unless configured to do otherwise; no distro appears to have done otherwise). The test added to nvh for this issue, which similarly looks for xz on the PATH, had 1:1 correlation with the given Linux distro's [GNU] tar successfully using xz to extract tarballs.)

(SmartOS is a different story, since it's not Linux, and has its own (unique?) tar.)

(As mentioned in previous comments, macOS uses bsdtar and acts a bit different compared to Linux, so I would like to test on old macOS.)

I have been able to test the script as updated for this issue. I tested it on quite a variety of Linux distros, including some older ones to check how far back xz support goes.

Here are my results: (click to expand)

| Operating System | nvh status | nvh status http | tar version on iso | latest tar in repos | xz installed on iso | xz available in repos | curl installed on iso | wget installed on iso | |---------------------------------|----------------|-----------------|----------------|----------------|---------------------|-----------------------|-----------------------|-----------------------| | Debian 5.0.10 Lenny | https error ** | no rsync | GNU tar 1.20 | GNU tar 1.20 | no | no | no | yes | | Debian 6.0.10 Squeeze | https error ** | good (gzip/xz) | GNU tar 1.23 | GNU tar 1.23 | yes | yes ("xz-utils") | no | yes | | Ubuntu 9.10 Karmic Koala | https error ** | good (gzip) | GNU tar 1.22 | GNU tar 1.22 | no | yes ("xz-utils") * | no | yes | | Ubuntu 10.04 Lucid Lynx | https error ** | good (gzip/xz) | GNU tar 1.22 | GNU tar 1.22 | no | yes ("xz-utils") | no | yes | | Ubuntu 12.04.5 Precise Pangolin | good (gzip/xz) | http not needed | GNU tar 1.26 | GNU tar 1.26 | yes | yes ("xz-utils") | no | yes | | Ubuntu 14.04.6 Trusty Tahr | good (gzip/xz) | http not needed | GNU tar 1.27.1 | GNU tar 1.27.1 | yes | yes ("xz-utils") | yes | yes | | Fedora 11 | good (gzip) | http not needed | GNU tar 1.22 | GNU tar 1.22 | no | no | yes | no | | Fedora 12 | good (gzip/xz) | http not needed | GNU tar 1.22 | GNU tar 1.22 | yes | yes ("xz") | yes | no | | Red Hat Enterprise Linux 7.7 | no rsync | http not needed | GNU tar 1.26 | no main repo | yes | no main repos | yes | no | | OpenSUSE 11.3 LiveCD | good (gzip/xz) | http not needed | GNU tar 1.23 | GNU tar 1.23 | yes | yes ("xz") | yes | yes | | OpenSUSE Leap 15.1 | good (gzip/xz) | http not needed | GNU tar 1.30 | GNU tar 1.30 | yes | yes ("xz") | yes | yes | | Arch Linux 2019 09 01 iso | good (gzip/xz) | http not needed | GNU tar 1.32 | GNU tar 1.32 | yes | yes ("xz") | yes | yes | | Manjaro Architect 18.04 Stable | good (gzip/xz) | http not needed | GNU tar 1.32 | GNU tar 1.32 | yes | yes ("xz") | yes | yes | | Gentoo 2019 09 01 iso | good (gzip/xz) | http not needed | GNU tar 1.32 | GNU tar 1.32 | yes | yes ("xz-utils") | yes | yes | | Triton SmartOS | https/tar fail | good/workaround | ?????? | ????????????? | yes | presumably | yes | yes | Legend: - "nvh status" means how well the script worked on the live image, or if there was no live image, immediately after installing. - "nvh status http" means how well nvh worked after applying workarounds for broken https. - (Workarounds: use `nvh` with `--insecure`; OR edit `nvh` to have only http URLs; OR use `curl` with `--insecure` or `wget` with `--no-check-certificate`.) \~ Statuses: - "https error" means tools like curl and wget fail to fetch resources from nodejs.org, due to being unable to validate the website's ssl certificates. This is not nvh's fault, and implies a system that would be unable to properly use curl/wget for https transfers. As a workaround, to download nvh from raw.githubusercontent.com (for testing), curl can be used with `--insecure`, `wget` can be used with `--no-check-certificate`; `nvh` itself can be run with `--insecure`. - "http not needed" means that the script worked without using `--insecure`, and without editing the script to use "http" in urls. - "good (gzip)" means that the script successfully uses/falls back to gzip. - "good (xz)" means that the script successfully uses xz if xz support is installed. - "good (gzip/xz)" means that the script successfully uses gzip and xz, depending on which compression schemes are installed and whether an xz tarball is available for the requested node version and the user's platform (this means the script works fully as intended.) Notes on Oses: - \* In Ubuntu Karmic Koala, the "xz-utils" package conflicts with the already-installed "liblzma" package, apparently depended on by "dpkg". Unlikely to be installed for fear of breaking dpkg? But xz-utils is available in the repos, and can work, if one is brave enough to uninstall conflicting packages (and break their system??) - \** In Ubuntu Karmic Koala & Lucid Lynx, wget fails to download files over https. In Karmic, curl also fails to download files over https. In Lucid, curl, which must be manually installed from the repositories (old-releases.ubuntu.com), succeeds at downloading files over https. - \*** In Ubuntu Precise Pangolin, wget fails to authenticate e.g. GitHub https URLs, but succeeds at authenticating Nodejs.org https URLs. For the purposes of the script, this is not a problem (we only need to access nodejs.org anyways.) - RHEL Server 7.7's "minimal" installation does not include rsync, so installing node with nvh fails at the last step (last step is to rsync from the cache folder to the "installed" folders). RHEL 7.7 works with gzip or xz up to the last (rsync) step. There are no packages repositories enabled by default, and no obvious public package repos anywhere (that I could find) for RHEL. So I gave up trying to install rsync. tj/n "develop" branch works fine with xz or gzip. Had to enable networking in RHEL manually after install, otherwise system was unable to reach the outside internet. (RHEL Server 7.7 does not have a command-line interface accessible on the install iso; it has no "LiveCD" environment. ) - Debian Lenny works fine with gzip, but only after manually installing rsync. xz support is not available on Lenny. - SmartOS has its own tar. It either doesn't extract files that are piped in, or I can't figure out the syntax to do so. When nvh is allowed to download the tarball, and then have tar extract that as a local file, it works. See the [man page for tar](https://smartos.org/man/1/tar) at smartos.com. Also, SmartOS has problems with https connections via wget/curl. Can use http as a workaround.

.

None of the operating systems/distros tested were adversely affected by the changes for this issue. (i.e. no OS had less ability to successfully download node with nvh after the changes for this issue, compared to the script as it is on the develop branch.)

Some distros failed to download node with nvh, but this was due to various things unrelated to the changes proposed for this issue. (These issues could be worked around with varying degrees of difficulty.)

For example: RHEL 7 was a bit wonky because it had no rsync installed. (unrelated to xz/this issue).

SmartOS (what Node means when they host tarballs for "sunos") was another case entirely. (Technically not Linux, though.) It rolled its own unique tar that I haven't seen anywhere else. This doesn't really work with piped in tarballs, or else I couldn't figure out the correct syntax to do so. But SmartOS does work with a minimal patch: https://github.com/DeeDeeG/nvh/compare/develop...DeeDeeG:smartos

(Apparently SmartOS isn't intended to be used directly as a general-purpose OS; rather, you are supposed to install VMs and containers on it. But the tarballs for SmartOS are up at nodejs.org/dist, so I thought I'd test anyway.)

Would still want to test this on older macOS, in an ideal world. But I am not exactly sure if I will get the opportunity. I know you can make bootable USB installers for old macOS, so maybe I will try it in Terminal on the bootable installer. (Don't want to actually re-install macOS multiple times.)

Misc note: xz support was added to GNU tar in version 1.22, back in March 2009. Here's the changelog entry.

[Edited 8 September to correct/clarify results and explanatory notes for Karmic Koala.]

DeeDeeG commented 5 years ago

I was able to test on OS X. Here are my results:

Click to expand.

| Operating System | nvh status | nvh status http | tar version in fresh install | latest tar in repos | liblzma.dylib present on fresh install | `xz` present in fresh install | xz available in repos | curl present on fresh install | wget present on fresh install | |---------------------------------|----------------|-----------------|------------------|----------------|----------------------------------------|---------------------|-----------------------|-----------------------|-----------------------| | OS X Snow Leopard (10.6) | good (gzip) | http not needed | bsdtar 2.6.2 | no apple repos | no | no | N/A (but yes w/ brew) | yes | no | | OS X Snow Leopard (10.6.8) | good (gzip) | http not needed | bsdtar 2.6.2 | no apple repos | no | no | N/A (but yes w/ brew) | yes | no | | OS X Mountain Lion (10.8.5) | good (gzip) | http not needed | bsdtar 2.8.3 | no apple repos | no | no | N/A (but yes w/ brew) | yes | no | | OS X Mavericks (10.9.5) | good (gzip/xz) | http not needed | bsdtar 2.8.3 | no apple repos | yes | no | N/A (but yes w/ brew) | yes | no | | OS X El Capitan (10.11.6) | good (gzip/xz) | http not needed | bsdtar 2.8.3 | no apple repos | yes | no | N/A (but yes w/ brew) | yes | no | | macOS Mojave (10.14.5) | good (gzip/xz) | http not needed | bsdtar 2.8.3 | no apple repos | yes | no | N/A (but yes w/ brew) | yes | no | Legend: - "liblzma.dylib" refers to the file "/usr/lib/liblzma.dylib" (a symlink to "/usr/lib/liblzma.5.dylib") being present on macOS. This presumably derives from the same "[XZ Utils](https://tukaani.org/xz/)" project as the "xz" executable does on other platforms. Notes: - All testing (for versions less than OS X 10.11 "El Capitan") was done in VirtualBox, on a Mac host running OS X El Capitan. OS Notes: - Mac OS X Snow Leopard was installed from an old installer CD on VirtualBox. Had to modify script to download only x64 images, since my install of OS X ended up being 32-bit. After that, nvh fundamentally worked fine. (Irrespective of 32/64-bit, there was no xz support in this OS's `tar`, so only gzip worked). - Technically, Snow Leopard 10.6.8 was an upgrade over a fresh install of 10.6, via the "Software Update" utility.

.

I found that Mavericks (OS X 10.9.x) can extract xz tarballs piped into tar.

So I believe the platform checks in place for macOS are correct.

[Edited 8 September to clarify that testing was done in VirtualBox, and that nvh was tested after installing to the hard disk, rather than testing "on the iso", or in the Terminal available on the installer image.] [Updated again to add whether /usr/lib/liblzma.dylib was present on the system.] [Edited 12 September to add OS X Mountain Lion]

DeeDeeG commented 5 years ago

I have confidence in these platform checks, but I admit there is increased complexity by trying to check the platform for xz support. It is a low impact problem if it's wrong and off-by-default, but it does become a big deal if it's wrong and on-by-default. Other version managers seem to have decided it was worth it.

I do think it works pretty well, but would like this to have a way to turn it off. A) just so people have control, and B) For the (seemingly unlikely) scenario where the platform checks are wrong. I feel that the environment variable, a command-line switch, or both would be adequate end-user control, which I have already made work in drafts posted to this issue.

DeeDeeG commented 5 years ago

This comment is mostly about whether to do checks that help us when the user has installed a non-platform-default tar... Which it would be pretty easy to decide NOT to do. But if we did want to do that, here's the relevant info. (Also discusses limits of my macOS platform check.)

Click to expand.

For what it's worth: Right now I only anticipate the platform checks to be wrong if the user has installed a non-platform-default tar that has no xz support.* I don't know why someone would do that, but... you never know? And the other scenario is if Apple breaks my macOS version check. e.g. If they increment the major version ("10") of macOS, or release/switch to a new platform other than macOS, or stop having the `sw_vers` utility installed... --- \* (We could check for this by making sure Linux users' `tar` is GNU tar 1.22+, and that macOS users' `tar` is bsdtar 2.8.3 or above from libarchive... Unsure if we can check if macOS users have bsdtar linked against liblzma, but we could check if liblzma is present at `/usr/lib/liblzma.dylib`...) Again, I am really unclear why someone would go through the trouble of installing non-platform-standard `tar`, and it feels a lot of extra gymnastics to go out of our way to check for that. But it would be doable. I can picture what the implementation looks like and could draft it up without much difficulty. --- Review of prior art... Do they check which tar is being used? NVS decided to check if local `tar` supports extracting xz. [here](https://github.com/jasongin/nvs/blob/5e8a8ccc8b7b4c8b4c3e1afad7b710afc8755068/nvs.sh#L52-L59) and [here](https://github.com/jasongin/nvs/blob/5e8a8ccc8b7b4c8b4c3e1afad7b710afc8755068/nvs.sh#L165-L189) in nvs.sh (but their check is probably wrong (false positive) for macOS 10.7-10.8; Would need an install disc for OS X Lion and Mountain Lion to confirm.) NVM mostly doesn't do so. see [here](https://github.com/nvm-sh/nvm/blob/0b5bb5ccd875d82e470568b6465d546346e37778/nvm.sh#L3572-L3604). (I note that they only platform check by whether `xz` is installed, which isn't used by tar on macOS, and is unrelated to successfully extracting xz with macOS's default `tar`, which is bsdtar.) They do check if the platform is AIX, and if so they use `gtar` instead of `tar`: [here](https://github.com/nvm-sh/nvm/blob/0b5bb5ccd875d82e470568b6465d546346e37778/nvm.sh#L1743-L1745). Without being able to test on AIX, I can't dispute, nor confirm, whether this is correct. I do not attempt to address use of AIX in my implementations, since I can't test any of it.

shadowspawn commented 5 years ago

Some great research, and I love the Linux and macOS tables of tests.

Good work tracking down the release note for GNU tar adding xz support. Likewise, digging into the nvs and nvm checks.

shadowspawn commented 5 years ago

Not sure if you discovered it, but there is support for trying things across multiple docker containers (currently a hard-coded list in the script).

$ pwd
/Users/john/Documents/Sandpits/nvh/test
$ bin/for-each-container nvh lsr lts
archlinux
v10.16.3

centos
v10.16.3

centos-old
v10.16.3

ubuntu-curl
v10.16.3

ubuntu-old-curl
v10.16.3

ubuntu-wget
v10.16.3

ubuntu-old-wget
v10.16.3

Darwin
v10.16.3

shadowspawn commented 5 years ago

The https errors are presumably due to old certificates in the docker containers. The best fix is to update the certificates, like here:

https://github.com/shadowspawn/nvh/blob/61ce713d78c7af56888f7d169971e53ed4684171/test/dockerfiles/Dockerfile-ubuntu-old-wget#L4

or as you mentioned, configure curl or wget or nvh to be insecure:

nvh --insecure install lts

or change the protocol to http when defining NVH_NODE_MIRROR (I came across this approach when researching proxy support):

NVH_NODE_MIRROR=http://nodejs.org/dist nvh doctor

shadowspawn commented 5 years ago

Is xz-wip-2 still the best branch to comment on implementation?

DeeDeeG commented 5 years ago

Indeed, xz-wip-2 is the best.

I would also point out that running the xz-wip-debug-messy branch will have it narrate what it "thinks" it is doing, which I have found useful, and so I have actually been testing with that, but end results should be identical between both branches. [Edit to add: I did test the xz-wip-2 branch on some, but not all distros/OSes, just to be sure. It worked the same as debug-messy in those cases.] I made xz-wip-2 by "compiling" down the checks in debug-messy into non-redundant checks, and discarding the echoes. Again, they should pass/reject a system for xz with identical results (given identical inputs).

Last note: I had it not set to xz on by default, just to make sure I was awake and testing with/without xz on all platforms.

DeeDeeG commented 5 years ago

On second thought: If you have interest in a command-line option for xz, such as nvh --use-xz, those are available here, and are just about trivial to implement: https://github.com/DeeDeeG/nvh/compare/xz-environment-variable...DeeDeeG:xz-commandline-argument

They are not in xz-wip-2.

Decided questions:

The basic concept "use xz" (in some way or other) seems decided already. (leaning yes)
Also it seems things are leaning toward "do platform checks." (leaning yes)
Last I read, things were leaning toward "xz feature is on by default" as well. (leaning yes)

Implementation feedback

Yes, also leaning towards auto-detection for xz rather than just manual. Yes, also prefer OS check over tar check.

I think an env override is worthwhile. I don't feel a command line switch is high value but the implementation is tidy, and more findable when hit problems, so ok with keeping that too.

I do not want short flag of -xz. Short flags are technical a single character, and in standard command line processing this expands to -x -z. Like when you do ls -al.

NVH_USE_XZ is currently true/blank/undefined. Rather than blank I would prefer "0" or "false" (pick one, not suggesting support both), and then forcing on would be defined and non-zero or non-false respectively. I do not have a feeling or whether 0 of false is more common. I have slight preference for false because it is more readable, but NO_COLOR and CLICOLOR both use zero, and false is actually a string and not a boolean as such. Perhaps 0 is better? Edit: leaning towards "0" for forcing off rather than "false". (Partly because I was originally confused when working with n code as I assumed false was the command or a built-in rather than just a string!)

Rather than testing for xz url using is_ok, I would prefer to test target node version to avoid the network call. I feel a bit guilty for suggesting performance over robustness, but want to minimise the network calls where reasonable.

Ask questions if any of these seem dubious or unclear, I might have misunderstood. :-)

DeeDeeG commented 5 years ago

Responding to feedback, then will work on an implementation:

Yes, also leaning towards auto-detection for xz rather than just manual. Yes, also prefer OS check over tar check.

I think an env override is worthwhile. I don't feel a command line switch is high value but the implementation is tidy, and more findable when hit problems, so ok with keeping that too.

I do not want short flag of -xz. [ . . . ]

:heavy_check_mark: Agreed on these points. Thank you.

NVH_USE_XZ is currently true/blank/undefined. Rather than blank I would prefer "0" or "false" [ . . . ]

Okay. This is certainly doable, if a little less compact/neat in the setup phase. Will try a proof-of-concept for this in my next iteration. Wrote up a bunch below on why alternatives are more complicated than parameter expansion, but this is all theory, and it's probably not a huge deal in practice. (I think we need to use if AND test, to check truthiness and non-null, respectively.)

Lots of thoughts about truthiness/non-null in bash, and what to do about it in nvh (click to expand)

Aside (some background info) about test and [ (click to expand)

`test`, aka `[` is a little weird. (`test`, aka `[`, is an actual command, not a bash built-in. See `man [` and `man test` which are both aliases for each-other, and share a man entry.) Perhaps un-intuitively, the command `[ -z "string" ]` (or `test -z "string"`) only tests for zero-_length_ of a string. It does not actually test for logical zero, nor logical truthiness/falsiness in general. Why? Because `test` never checks truthiness in the expected way; it treats all input as text strings. The folloing are all non-null, (non-empty strings), so `test` evaluates them to true: - [ true ] - [ false ] - [ 0 ] - [ null ] The only thing `test` evaluates to "false" is the empty argument: - `[ ]`

. Other than parameter expansion without the colon, e.g. `"${var-true}"`, I don't know of a way to check (with a single command/builtin) for three states of a single variable (unset,set-null,set-nonempty) in bash. We can chain `if` and `test`, though. We can do `if "${var}"` to evaluate truthiness/falsiness. So we could do `if ! "${var}" && [ -n "${var}" ]` to check logical falsiness (evaluates `0` as false) and non-zero length (distinguishes between null and the string "0").

Aside: why recognizing three states is important

For this variable, we do need to preserve the three states. - "set and something falsy (currently null)" is requesting xz off. - "set and true-ish" (currently just any non-blank string) is requesting it on (pending platform checks). - "unset" means we register no preference from the user. IMO having three states is important, in case someone modifies the script to have xz off by default. The environment variable and/or command-line switch should still both do what they say on the tin. (If this is not important, we can simplify, by setting xz on by default, and no-op for requests to "turn on xz").

. I find the parameter expansion that handles "set-null/set-true/unset" is the most compact way to present these three options to the later parts of the script. This is also backward-compatible with the existing implementation with `n`. The `n` implementation is currently like this: - empty or unset is requesting xz off. (However, xz is off by default, so the brief "set xz" code paths basically no-op in this case, leaving the default gzip implementation to do its thing. We can back-port "0 means force off" to n if you like.) - "set and true-ish" (just checks that `N_USE_XZ` is non-blank) is requesting/forcing it on. So yeah, IMO as long as n and nvh eventually match, I am on board. I will see how tidy I can make a "unset/0/true" branching logic, rather than "unset/set-null/set-true".

.

Rather than testing for xz url using is_ok, I would prefer to test target node version to avoid the network call. I feel a bit guilty for suggesting performance over robustness, but want to minimize the network calls where reasonable.

This is a subjective one and I leave it as your call in the end. Cost-benefitting this out, from my perspective:

Draft implementation increases the most-common code path from having one invocation of is_ok to two, thus increasing network calls, as you point out.
- This is only a HEAD (headers-only) network request, so it is a very small size over the wire both ways, but for links with long round-trip-time, delay could be significant.
- Minimal impact on links with low round-trip-time to mirror.
Draft implementation is robust to Node.js possibly dropping .tar.gz tarballs for some platforms (https://github.com/nodejs/build/issues/1584)
- Version checks would not be robust to this.
- Seems somewhat unlikely they will drop .tar.gz, so this may be only a theoretical advantage

I would propose this as a potential solution:

Drop is_ok check from can_use_xz()
Rely on the is_ok from install()
- If this fails, and NVH_USE_XZ is true, try setting NVH_USE_XZ to "0"/"false" and doing is_ok again.

This proposal doesn't sacrifice any robustness, and moves the dual network request onto a rarer code path. (Still easy to test, e.g. by attempting to download an older version of node, such as 0.8.)

Related, but a mild tangent: If version checking rather than using is_ok for the xz tarball URL, I would like to version check for version "3" and up, since io.js had xz on both Linux and macOS by that version. (io.js forked node.js and immediately incremented the version to 1.0, then node.js took back over and immediately incremented the version to 4.0, so the version scheme and development of features are relatively continuous and unidirectional, despite the forking.)

DeeDeeG commented 5 years ago

Short summary: For determining boolean truth from a variable's value, bash can only distinguish between "unset, set-but-null, and set-and-not-null". Beyond that, we have the exit status of commands, or string comparison. That's about it. Handling "0" or "false" as environment variable values would depend on string comparison.

I am shocked to find there is no obvious way of evaluating boolean "truth" in a traditional programming-language-analogous way in bash. Only whether a variable is unset, set but empty/null (zero-length), or set and non-empty (non-zero-length). And the only way to test for an unset variable involves parameter expansion of the form: ${}.

I was specifically mistaken about using if to evaluate truth values; I had been setting NVH_USE_XZ to "true" or "false" or "0", leading to this: if "${NVH_USE_XZ}" (which evaluates NVH_USE_XZ, runs its value as a command (!!!) and checks the exit status.)

if true (always evaluates to true, since the command true always exits with status 0)
if false (always evaluates to false, since the command false always exits with status 1)
if 0 (complains that 0 isn't a command 0: command not found and exits 1... The fact that this works is a (perhaps dangerously misleading) coincidence.)

This is clearly bad, since we will run any command contained in the value of NVH_USE_XZ from the environment. Basically, arbitrary code execution.

See this for the full extent of the shenanigans that is "true" and "false" in bash: https://superuser.com/questions/1400335/how-does-bash-test-false

I have opted instead to do a case construction in my first-round-of-feedback, proof-of-concept implementation.

# Use xz compression by default for tarball downloads. (If NVH_USE_XZ is unset, set it to "true".)
# Based on this: https://stackoverflow.com/a/13864829 
[ -z "${NVH_USE_XZ+x}" ] && NVH_USE_XZ="true"

# Sanitize "false", "0" and "" to the empty string. Sanitize other values to "true".
case "${NVH_USE_XZ}" in
  "0" | "false" | "")
    NVH_USE_XZ=""
    ;;
  *)
    NVH_USE_XZ="true"
esac

This is the only reasonable way to handle "false" and "0" in my opinion, since otherwise we have to do string comparisons for those values throughout the script. It makes sense to normalize to something, and normalizing all "false" values to the empty string means we can do [ -n "${NVH_USE_XZ}" ], rather than full-fat string comparisons.

(The other option I can think of (see initial implementation) is to "not specially handle" false and 0, and treat them just as non-empty strings, i.e. normalize them to true. Then we would have to advise users that setting NVH_USE_XZ to be non-empty means "use xz", setting it empty means "don't use xz." and if they don't set NVH_USE_XZ, we just do the default.)

DeeDeeG commented 5 years ago

I made some updated implementations after considering the first round of feedback. They differ on how to update the "does an xz tarball exist?" check. (Various additional decisions were made to respond to feedback, as laid out in the two preceding comments.)

https://github.com/DeeDeeG/nvh/compare/develop...DeeDeeG:xz-pr-candidate-with-feedback
- Moves second is_ok off of the typical code path onto a fallback one.
https://github.com/DeeDeeG/nvh/compare/develop...DeeDeeG:xz-pr-candidate-with-feedback-alt
- Uses a minimum node version to gate xz (3.0 and newer). Gets rid of second is_ok entirely.

Open to more feedback, especially based on these or the previous pr-candidate[. . .] branches.

shadowspawn commented 5 years ago

https://github.com/shadowspawn/nvh/issues/8#issuecomment-532351953

I went through similar boolean learning (and surprise), and decided to normalise to "true"/"false" for easy reading. Full-fat string comparisons. 😄

Have a look at USE_COLOR and install_writeable and GET_OPTIONS_SHOW_PROGRESS

I'll have a look at the updated implementations.

DeeDeeG commented 5 years ago

Example implementation with true/false for NVH_USE_XZ: https://github.com/DeeDeeG/nvh/compare/develop...DeeDeeG:xz-pr-candidate-with-feedback-true-false

I'm inclined to go with "true/false" for this feature at the moment, if only to match the rest of the script's internals, but I suppose "1/0" would also be okay. (The internals are a code style/developer consideration, but what we expect users to set their environment variables to is also a UI/UX consideration.)

Might not need the case for normalizing input now, since we can tell users to set the variable to "true/false" (or "1/0"), and other values would simply not be supported. (This would be different from the way it is in tj/n now, though, where "null" is implied to be meaningful. Maybe the existing implementation there should be deprecated/replaced with one that matches what is decided on here? I doubt if anyone has started using it there yet?)

Aside about performance of "full-fat string comparisons" (click to expand)

As for "full-fat string comparisons"... For some reason it's been drilled into my head that "string comparisons are bad" because they're supposedly "very slow." I tried changing `NVH_USE_XZ` to be "true/false" or "1/0" and it didn't seem slower to me than "empty/not-empty", really. Maybe it's slower if you do it thousands of times in a loop, but this isn't thousands of times in a loop. ~~And I haven't even tested that.~~ Just tested it, and 10,000 full string comparisons (`[ "${var}" = "true" ]`) and 10000 string-non-null checks (`[ -n "${var}" ]`) both take about 200 milliseconds, give or take 20 milliseconds. integer comparison checking the variable against "0" seemed consistently a bit slower??? But even so, performance impact is quite negligible. Conclusion: Seems like the people who said string comparisons are slow were wrong. At least for bash. And I suspect the advice is outdated, as the CPUs have gone from MHz to GHz... A few more instructions doesn't mean a lot. (And it's not like we're compiling down to assembly/machine code where these optimizations are more likely to manifest, I guess? Not a big compiled language person at the moment.)

shadowspawn commented 5 years ago

I had something like this in mind. Note, this code fragment is not passing in the target version to can_use_xz as I wrote it looking at one of your branches without a version check!

# User may set NVH_USE_XZ to 0 to disable, or set to anything else to enable, or undefined for automatic.
# Normalise to true/false.
if [[ "${NVH_USE_XZ}" = "0" ]]; then
  NVH_USE_XZ="false"
elif [[ -n "${NVH_USE_XZ+defined}" ]]; then
  NVH_USE_XZ="true"
elif can_use_xz(); then
  NVH_USE_XZ="true"
else
  NVH_USE_XZ="false"
fi

# May be overridden by command line flags.

shadowspawn commented 5 years ago

Style: I am using [[ x ]] rather than [ x ] or test x, per:

https://google.github.io/styleguide/shell.xml?showone=Test,_%5B_and_%5B%5B#Test,_%5B_and_%5B%5B

shadowspawn commented 5 years ago

I think you have all the pieces I would want, scattered among the branches! Should I try and sort out the pieces so you can put together a PR?

DeeDeeG commented 5 years ago

Sure, that sounds good. It's getting into the smaller details now where I think a PR makes sense.

DeeDeeG commented 5 years ago

Re: this comment (which suggests using a single if-else[...] statement to handle NVH_USE_XZ, with "xz off, xz on, and auto xz" as the three code paths)... Looks pretty reasonable.

Unclear if the "xz on" code path should "force" xz on, i.e. skip the minimum node version check (or xz URL is_ok check). (As opposed to the "Auto" code path, which could check more rigorously that xz works before settling on using xz.)

Pros:

If we go with a minimum node version check for xz, users could override this to download versions 0.10.42+ and 0.12.10+ for node.js, 1.0+ (Linux) and 2.3.2+ (macOS) from iojs.org/dist
If we go with a backup gzip is_ok after failed xz is_ok, users could override this and skip the second network request
Makes nvh easier to use with forks/mirrors of node with custom version schemes? (Do these forks exist? Would anyone need this particular use-case with nvh?)

Con:

Increased script complexity (for a tiny feature that IMO seems somewhat unlikely to be used)

Related: If we don't distinguish between "forced on" and "auto on" later in the script, the if-else statement could potentially be simplified a bit. (But I suppose it is a little less fork-friendly to remove the distinction, since that would mean it's a tiny bit more work to turn xz off by default in a fork. (Is fork-friendliness a priority? Do people fork nvh/n often?))

DeeDeeG commented 5 years ago

Re: https://github.com/shadowspawn/nvh/issues/8#issuecomment-533034644 again: I just noticed we can't use the function can_use_xz() in the setup phase, at least without refactoring, because the setup phase precedes all the function definitions.

Working on putting together a PR.

shadowspawn commented 5 years ago

Unclear if the "xz on" code path should "force" xz on, i.e. skip the minimum node version check

Interesting question, and I have been thinking about that. I started out thinking that command line should override all checks, but have changed my mind.

I think it is reasonable that the version check is independent of the auto check. The auto check is looking at the system support for xz, and can be overridden by user due to preference or incorrect or incomplete detection. The version check is a limitation in availability of that archive format. (While in theory there could be mirrors with different combinations of archive formats, only worth revisiting when proves an actual use case.)

shadowspawn commented 5 years ago

Makes nvh easier to use with forks/mirrors of node with custom version schemes?

Not a consideration at this time.

On a related note, there was some support in n for using with other "products" by defining environment variables. I never found an actual use of this, and removed the support in nvh early on and now gone from n too.

shadowspawn commented 5 years ago

(Is fork-friendliness a priority? Do people fork nvh/n often?)

Simple answer is no. I don't see issues reported from people having issues integrating with forks, including from the recent major changes.

Focus on making the code simple and robust for our use case. 😄

DeeDeeG commented 5 years ago

On a related note, there was some support in n for using with other "products" by defining environment variables. I never found an actual use of this, and removed the support in nvh early on and now gone from n too.

In other words, trying to get io.js doesn't really work anymore? (because the "product name" would be "iojs" rather than "node" or something like that?)

I am trying setting iojs.org as the custom mirror at the moment, but it fails with this: Error: download preflight failed for v2.5.0 (https://iojs.org/dist/v2.5.0/node-v2.5.0-linux-x64.tar.gz)

whereas the proper url would have v[num]/iojs-v[num]-[os]-[arch].[ext]

shadowspawn commented 5 years ago

A musing, rather than a suggestion. If the internal and external formats for a variable are different, then may be appropriate to use a different name. For example if externally NVH_USE_XZ is undefined/0/defined and internal we want a variable which is true/false, then might be cleaner to use different variable name.

It is convenient to reuse when supplying a default and doing minor normalising, like with NVH_NODE_MIRROR.

shadowspawn commented 5 years ago

In other words, trying to get io.js doesn't really work anymore? (because the "product name" would be "iojs" rather than "node" or something like that?)

It was more a general comment that io in particular. There used to be explicit support for io with variations on a number of the routines, separate from the generic "product" support. The io support got removed too as extra complication and obsolete.

shadowspawn commented 4 years ago

Preference for xz downloads, if detected, released in nvh 9.0.0

Thank you for your contributions.

shadowspawn / nvh

XZ compression support #8

Implementation feedback