lukego / blog

Luke Gorrie's blog
566 stars 11 forks source link

NixOS troubleshooting with git bisect #17

Open lukego opened 7 years ago

lukego commented 7 years ago

NixOS is an amazing Linux distribution. The InfoQ article and thesis are well worth your time to read. Meanwhile, here is a new trick I discovered for debugging Linux distribution upgrades using git bisect.

I upgraded from NixOS 15.07 to 17.03 and found that the Pharo Virtual Machine had broken. Starting the VM would cause a Segmentation Fault within around one second. There was no obvious cause in the Pharo VM code itself: it seemed to be indirectly caused by a change in some dependency. There had been around 35,000 package updates to NixOS between those two releases, so how do you know which one is the problem?

It turns out that you can use git bisect to answer that question automatically. This is because the whole NixOS distribution is defined in a Git repository (nixpkgs) and so the history of every update to every package is tracked. So all I needed to do is write a script that starts the Pharo VM and checks whether it prints Segmentation fault within the first few seconds of execution. Easy, here it is:

#!/usr/bin/env bash
nix-env -j 10 -f . -iA pkgs.pharo-launcher || exit 125
timeout --preserve-status 20 pharo-launcher | grep '(Segmentation fault)'
status=$?
if [ "$status" == 0 ]; then
    echo "SEGFAULT"
    exit 1
else
    echo "OK"
    exit 0
fi

Then once I have this script I can ask git bisect to please find the commit that introduces the segmentation fault, considering all updates to all packages in the whole NixOS universe:

git bisect start master 15.09
git bisect run ./pharo-nix-bisect.sh

Finding the bad commit from a set of 35,000 actually only requires around 15 tests because git bisect uses a logarithmic-time binary search.

Result

This test ran for a few hours, testing many different versions of the whole OS including compiler toolchains, etc, and then finally pointed me in the right direction. It turns out that the problem was introduced by adding "hardening" to the default CFLAGS on NixOS and particularly by building Pharo with -fPIC which is not compatible with the VM. So I disabled -fPIC for the Pharo package on my nixpkgs branch, sent a pull request upstream, and went on with my day.

Truly, this feels like a small step towards "dependency heaven." Thanks, Nix!

lheckemann commented 7 years ago

I used the same technique to trace this issue down to a guile upgrade and rapidly fix it. Nix and git are both wonderful tools, and in combination they're amazing. On top of that, nix didn't even have to rebuild weechat every time because not all the commits affected weechat's dependencies, and it took just about 10min to find the problem.