Closed vladak closed 4 months ago
The adjusted document check is needed to test the fix for #4317 , so I added the changes here as well.
I let the the following script run overnight on my laptop (Intel Core i5 - 8 threads, SSD). It ran 36 times, each run had around 100 iterations. This gives me high level of confidence that the duplicates are gone for good.
#!/bin/bash
#
# attempt to reproduce https://github.com/oracle/opengrok/issues/4317
#
# based on git-commit-hopping.sh but with added randomness
# set -x
set -e
repo_url="https://github.com/oracle/solaris-userland/"
initial_rev=32c0d9faed7b049872ca9bd78f9bf3e901cff482 # from 2022
src_root="/var/tmp/src.opengrok-issue-4317"
data_root="/var/tmp/data.opengrok-issue-4317"
# Assumes built OpenGrok.
function run_indexer()
{
echo "Indexing $1"
java -cp '/home/vkotal/opengrok-vladak-scratch/distribution/target/dist/*' \
-Djava.util.logging.config.file=/var/tmp/logging.properties-opengrok-FINEST_Console \
org.opengrok.indexer.index.Indexer \
-c /usr/local/bin/ctags \
--economical \
-H -S -P -s "$src_root" -d "$data_root" \
-W "$data_root/config.xml" \
>/var/tmp/opengrok-issue-4317-$1.log 2>&1
java -cp '/home/vkotal/opengrok-vladak-scratch/distribution/target/dist/*' \
org.opengrok.indexer.index.Indexer \
-H \
-R "$data_root/config.xml" \
--checkIndex documents \
>/var/tmp/opengrok-issue-4317-$1.check.log 2>&1
}
function get_next()
{
ids=( $(git log --oneline --reverse ..origin/master | awk '{ print $1 }') )
size="${#ids[@]}"
modulo=16
if (( size == 0 )); then
echo ""
return
fi
if (( modulo > size )); then
modulo=$size
fi
n=`expr $RANDOM % $modulo`
if (( n == 0 )); then
n=1
fi
echo ${ids[$n]}
}
project_root="$src_root/solaris-userland"
if [[ ! -d $src_root ]]; then
echo "Cloning $repo_url to source root"
git clone $repo_url "$project_root"
fi
echo "Removing data root"
rm -rf "$data_root"
echo "Removing logs"
rm -f /var/tmp/opengrok-issue*.log
cd "$project_root"
echo "Checking out base rev $initial_rev"
git checkout -q "$initial_rev"
# Establish a common time base line.
find "$src_root/" -type f -exec touch {} \;
run_indexer $initial_rev
while [[ 1 ]]; do
rev=$(get_next)
if [[ -z $rev ]]; then
break
fi
echo "Checking out $rev"
git checkout -q $rev
# Git does not preserve/restore file time stamps so simulate a git pull.
# Ideally this should be done only for the "incoming" files, howerver it suffices for this use case.
find "$src_root/" -type f -exec touch {} \;
run_indexer $rev
done
As noted on https://github.com/oracle/opengrok/issues/4317#issuecomment-1904361808 , there is another way how the index could be broken - if there are live (i.e. not deleted) documents that are missing the source file. This change augments the document check to report these.