projectatomic / docker

Docker - the open-source application container engine
http://www.docker.com
Apache License 2.0
81 stars 58 forks source link

Fix Docker search race/panic - BZ1732626 #362

Closed TomSweeneyRedHat closed 5 years ago

TomSweeneyRedHat commented 5 years ago

Signed-off-by: TomSweeneyRedHat tsweeney@redhat.com

- What I did

Fixes the Bugzilla: Bug 1732626 - Docker panics when performing docker search due to potential Search bug when using multiple registries (https://bugzilla.redhat.com/show_bug.cgi?id=1732626) for the docker-1.13.1-rhel branch. The results variable passed to the searchTerm function was being passed by pointer. However as the function was within a threaded environment, it was becoming corrupted randomly under load. I've changed the function to pass the variable by reference, keeping the data private per thread.

Many thanks to Uli Obergfell for the excellent analysis and easy reproducer in the BZ.

- How I did it VI and a lot of sweat blood and tears.

- How to verify it With the fix in place, run the following script from Uli. It should just return periods. During a failure case, it will list a number of core files. Refer to the BZ for a failure example.

# cat ~/repro.sh
#!/bin/bash

date

kill_dd() {
    echo
    pkill -s 0 dd
    exit
}

CORESDIR=/tmp/docker-current-cores
mkdir $CORESDIR > /dev/null 2>&1

rm -f /tmp/docker-current-cores/*

trap kill_dd 2

for i in `seq 0 7`
do
    dd if=/dev/zero of=/dev/null bs=1024k &
done

while :
do
    echo -n "."
    docker search rhel7 > /dev/null 2>&1
    for f in /core.*
    do
        file $f | grep "/usr/bin/dockerd-current" > /dev/null 2>&1
        if [ $? -eq 0 ]
        then
            echo
            date
            mv $f $CORESDIR
            ls -l $CORESDIR
        fi
    done
done

A test result showing ~30 minutes of successful runtime:

# ~/repro.sh
Thu Oct  3 17:54:08 EDT 2019
.............................................................................................................................................................................................................................................................................................................................................................................................................................................

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

rhatdan commented 5 years ago

LGTM

TomSweeneyRedHat commented 5 years ago

@edsantiago reports that his test suite passes with a patch from this. He was able to duplicate the error pre-patch in about 5 minutes runtime. After patch no error after more than an hour runtime. I left my test on accidentally overnight, no cores after 10+ hours.

TomSweeneyRedHat commented 5 years ago

@vrothberg @giuseppe PTAL