owid / etl

A compute graph for loading and transforming OWID's data
https://docs.owid.io/projects/etl
MIT License
79 stars 21 forks source link

Can't delete some containers -- dataset is busy #2071

Closed larsyencken closed 10 months ago

larsyencken commented 10 months ago

When copying and moving containers, it seems that containers on foundation-1 cannot be deleted.

It appears to be a snapd/lxd combination issue. We might be able to delete them if we disable lxd temporarily.

See upstream issue: https://github.com/canonical/lxd/issues/11168

Marigold commented 10 months ago

We've encountered this before. I didn't find a proper fix, but I got it working through this workaround.

larsyencken commented 10 months ago

Oh, cool! In this case, it was fixed with a reboot, but ideally you wouldn't have to.

ExpatUK commented 8 months ago

We've encountered this before. I didn't find a proper fix, but I got it working through this workaround.

What was the workaround? The link is now dead.

Marigold commented 8 months ago

@ExpatUK below is a workaround. We grab all pids and then run the sudo nsenter -t ... command on them (as per the discussion) until we're able to destroy the container. It's super hacky, but it works. The solution from the discussion might be enough for you.

def _destroy_container(host: Host, name: str):
    print(f"--- Destroying container {name} and its data")
    try:
        host.run(f"sudo lxc delete --force {name}")
    except SystemExit:
        # This is LXC bug https://discuss.linuxcontainers.org/t/lxc-delete-result-in-failed-to-destroy-zfs-filesystem-dataset-is-busy/5728
        # We get the following error when trying to delete a container:
        # Error: Error deleting storage volume: Failed to run: zfs destroy -r lxd/containers/staging-site-scripts-relative-url:
        # exit status 1 (cannot destroy 'lxd/containers/staging-site-scripts-relative-url': dataset is busy)
        # The workaround is to unmount the filesystem and try again.
        print("!!! Container could not be destroyed, trying workaround")
        pids = host.run('pgrep -fl "lxc monitor"', capture_output=True).replace(
            "lxd", ""
        )
        # try all pids in random order (perhaps there's a better way?)
        pids = [pid.strip() for pid in pids.split("\n")]
        random.shuffle(pids)

        for pid in pids:
            try:
                host.run(
                    f"sudo nsenter -t {pid} -m -- umount /var/snap/lxd/common/lxd/storage-pools/lxd_zfs/containers/{name}"
                )
            except SystemExit:
                print(f"... Could not be unmounted with pid {pid}, trying new pid")
                continue

            host.run(f"sudo lxc delete --force {name}")
            print("!!! Container destroyed")
            break
        else:
            print("!!! Container could not be destroyed")