pkgcore / pkgcheck

pkgcore-based QA utility for ebuild repos
https://pkgcore.github.io/pkgcheck
BSD 3-Clause "New" or "Revised" License
34 stars 29 forks source link

Possible to run `parallel "pkgcheck scan -k RedundantVersion ..."`? #655

Closed mattst88 closed 5 months ago

mattst88 commented 5 months ago

I have a script that uses pkgcheck to find and remove redundant versions of packages, named clean-redundant-versions.sh

#!/bin/bash

pkgcheck scan -k RedundantVersion -R FormatReporter --format "{package}-{version}.ebuild" | xargs -I {} bash -c 'test -d {} || git rm {}'
pkgdev manifest
git diff-index --quiet HEAD -- || pkgcommit -s -m "Drop old versions" .

It works, but it's not fast.

I tried speeding it up by using GNU parallel to find redundant versions in parallel:

#!/bin/bash

set -e

maint=${1}
shift

cd "$(git rev-parse --show-toplevel)"
mapfile -t pkgs < <(git grep -l "${maint}" '*/*/metadata.xml' | cut -d/ -f1-2)

parallel "pkgcheck scan -k RedundantVersion -R FormatReporter --format '{category}/{package}/{package}-{version}.ebuild' {}" ::: "${pkgs[@]}" | xargs git rm

mapfile -t cleaned_pkgs < <(git diff-index --name-only HEAD | cut -d/ -f1-2 | sort -u)
parallel "pkgdev manifest {}" ::: "${cleaned_pkgs[@]}"
for pkg in "${cleaned_pkgs[@]}"; do
        pushd "${pkg}"
        pkgcommit -s -m "Drop old versions" .
        popd
done

When I run this, it generates tracebacks such as:

FileNotFoundError: [Errno 2] No such file or directory: '/home/mattst88/.cache/pkgcheck/repos/gentoo-ba548c5768/.update.profiles.pickle'
pkgcheck scan: error: failed dumping profiles cache: '/home/mattst88/.cache/pkgcheck/repos/gentoo-ba548c5768/profiles.pickle': No such file or directory
Exception ignored in: <function AtomicWriteFile_mixin.__del__ at 0x7f17de08f7e0>
Traceback (most recent call last):
  File "/usr/lib/python3.11/site-packages/snakeoil/fileutils.py", line 134, in __del__
    self.discard()
  File "/usr/lib/python3.11/site-packages/snakeoil/fileutils.py", line 111, in discard
    os.unlink(self._temp_fp)

The tracebacks are generated from the parallel "pkgcheck scan -k RedundantVersion ..." command.

Is it safe to run pkgcheck multiple times on the same repository at the same time? From the traceback it appears it's generating a cache—is it possible to generate this cache once up front before the parallel invocation in order to avoid this problem?

Other suggestions? Thanks!

arthurzam commented 5 months ago

I know at least 2 places which aren't safe for parallel runs:

  1. cache regen
  2. --commits git checks (this one wasn't even safe to be run in parallel in the same pkgcheck invocation, which required me to modify those checks into sequential)

For point 2, I just prefer to not do anything - I think this isn't critical now. For point 1, maybe I'll try with some kind of file lock in the cache dir, similar to a mutex.


On another question, I'm not sure how multiple pkgcheck instances makes it faster? pkgcheck by default is parallel, and it should try to use all cores. Maybe you have somewhere config/default for --jobs? Please try to pass --jobs ${N} and see what it does? The check itself looks simple, so the CPU intensive part is mainly the part of parsing and loading the package ebuilds. I think the current schedular of pkgcheck should handle it well?

mattst88 commented 5 months ago

In the process of trying to use GNU parallel I apparently lost my original script, so I don't know why it was so slow for me...

In any case, I've rewritten it:

#!/bin/bash

set -e

maint=${1}
shift

cd "$(git rev-parse --show-toplevel)"
mapfile -t pkgs < <(git grep -l "${maint}" '*/*/metadata.xml' | cut -d/ -f1-2)
mapfile -t redundant_ebuilds < <(pkgcheck scan -k RedundantVersion -R FormatReporter --format "{category}/{package}/{package}-{version}.ebuild" "${pkgs[@]}")
git rm "${redundant_ebuilds[@]}"
mapfile -t cleaned_pkgs < <(git diff-index --name-only HEAD | cut -d/ -f1-2 | sort -u)
pkgdev manifest "${cleaned_pkgs[@]}"

for pkg in "${cleaned_pkgs[@]}"; do
    pushd "${pkg}" &> /dev/null
    pkgcommit -s -m "Drop old versions" .
    popd &> /dev/null
done

for pkg in "${cleaned_pkgs[@]}"; do
    pushd "${pkg}" &> /dev/null
    if [[ -d files ]]; then
        printf "\nPlease check whether anything in files/ should be removed:\n\n"
        ls -1 files
        printf "\n'git rm' any unused files and run 'git commit -a --fixup \$(git last-commit-to .)'. Press CTRL+D when finished.\n"
        $SHELL
    fi
    popd &> /dev/null
done

and it's plenty fast, so I'm going to close this. Thanks, and sorry for the noise.