check missing source files for documents in document check and fix path ordering in history based reindex

vladak commented 5 months ago

As noted on https://github.com/oracle/opengrok/issues/4317#issuecomment-1904361808 , there is another way how the index could be broken - if there are live (i.e. not deleted) documents that are missing the source file. This change augments the document check to report these.

vladak commented 5 months ago

The adjusted document check is needed to test the fix for #4317 , so I added the changes here as well.

vladak commented 4 months ago

I let the the following script run overnight on my laptop (Intel Core i5 - 8 threads, SSD). It ran 36 times, each run had around 100 iterations. This gives me high level of confidence that the duplicates are gone for good.

#!/bin/bash

#
# attempt to reproduce https://github.com/oracle/opengrok/issues/4317
#
# based on git-commit-hopping.sh but with added randomness

# set -x
set -e

repo_url="https://github.com/oracle/solaris-userland/"
initial_rev=32c0d9faed7b049872ca9bd78f9bf3e901cff482    # from 2022

src_root="/var/tmp/src.opengrok-issue-4317"
data_root="/var/tmp/data.opengrok-issue-4317"

# Assumes built OpenGrok.
function run_indexer()
{
    echo "Indexing $1"

        java -cp '/home/vkotal/opengrok-vladak-scratch/distribution/target/dist/*' \
        -Djava.util.logging.config.file=/var/tmp/logging.properties-opengrok-FINEST_Console \
            org.opengrok.indexer.index.Indexer \
            -c /usr/local/bin/ctags \
        --economical \
            -H -S -P -s "$src_root" -d "$data_root" \
        -W "$data_root/config.xml" \
        >/var/tmp/opengrok-issue-4317-$1.log 2>&1

    java -cp '/home/vkotal/opengrok-vladak-scratch/distribution/target/dist/*' \
            org.opengrok.indexer.index.Indexer \
        -H \
        -R "$data_root/config.xml" \
        --checkIndex documents \
        >/var/tmp/opengrok-issue-4317-$1.check.log 2>&1
}

function get_next()
{
    ids=( $(git log --oneline --reverse ..origin/master | awk '{ print $1 }') )
    size="${#ids[@]}"
    modulo=16
    if (( size == 0 )); then
        echo ""
        return
    fi
    if (( modulo > size )); then
        modulo=$size
    fi
    n=`expr $RANDOM % $modulo`
    if (( n == 0 )); then
        n=1
    fi
    echo ${ids[$n]}
}

project_root="$src_root/solaris-userland"
if [[ ! -d $src_root ]]; then
    echo "Cloning $repo_url to source root"
    git clone $repo_url "$project_root"
fi

echo "Removing data root"
rm -rf "$data_root"

echo "Removing logs"
rm -f /var/tmp/opengrok-issue*.log

cd "$project_root"
echo "Checking out base rev $initial_rev"
git checkout -q "$initial_rev"
# Establish a common time base line.
find "$src_root/" -type f -exec touch {} \;

run_indexer $initial_rev

while [[ 1 ]]; do
    rev=$(get_next)
    if [[ -z $rev ]]; then
        break
    fi
    echo "Checking out $rev"
    git checkout -q $rev
    # Git does not preserve/restore file time stamps so simulate a git pull.
    # Ideally this should be done only for the "incoming" files, howerver it suffices for this use case.
    find "$src_root/" -type f -exec touch {} \;

    run_indexer $rev
done

oracle / opengrok

check missing source files for documents in document check and fix path ordering in history based reindex #4535