Panic error when running siegfried

hmiguim commented 2 years ago

Siegfried version: 1.9.1 in server mode

siegfried-stderr.log:

2021/12/03 22:12:13 Starting server at localhost:5138. Use CTRL-C to quit.
panic: sync: negative WaitGroup counter

goroutine 401476 [running]:
sync.(*WaitGroup).Add(0xc0069c8690, 0xffffffffffffffff)
        /home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:74 +0x147
sync.(*WaitGroup).Done(...)
        /home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:99
main.identifyFile.func1(0xc005ec6500, 0xc00033f4a0, 0xc005d42740)
        /home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:205 +0x75
created by main.identifyFile
        /home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:202 +0xe5

richardlehane commented 2 years ago

thanks for reporting this Miguel, I'll investigate

luis100 commented 2 years ago

@richardlehane any updates on this? Currently, to workaround this issue, we've downgraded Siegfried to 1.7.13 where the issue doesn't seem to affect us.

ross-spencer commented 2 years ago

hi @luis100 @hmiguim I am wondering if you have any more information about what's triggering the error? I have reinstalled 1.9.1 on Ubuntu 21.04, Go 1.16.4 and I am able to return through GET and POST requests on the command line and through the web UI.

Before running I ran sf -update to make sure the signature file was up to date as well.

Example GET:

http://localhost:5138/identify/%2Ftmp%2Ftest?format=json

Example POST:

curl "http://localhost:5138/identify?format=json&hash=crc" -F file=@/tmp/test

Perhaps your command, operating environment, or build are having an impact here?

NB. It looks like the error begins here. We can see what happens when we step through it with more info.

NB++. I have a windows installation available, running 1.9.1 from the releases, I can also run the following in the browser (I haven't access to cURL on this machine to try that):

C:\Users\RS>sf -serve localhost:5381
http://127.0.0.1:5381/identify/C%3A%5Ctemp%5Cgifs%5C1880412.png?format=json

luis100 commented 2 years ago

We are using a docker image based on Debian where we install siegfried and use it in server mode.

docker run -it --rm keeps/roda:v4.1.1 bash
# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 11 (bullseye)
Release:    11
Codename:   bullseye

# sf -version
siegfried 1.9.1
/usr/share/siegfried/default.sig (2020-10-06T19:13:40+02:00)
identifiers: 
  - pronom: DROID_SignatureFile_V97.xml; container-signature-20201001.xml
config: 
  - multi: 32

# supervisorctl status siegfried
siegfried                        RUNNING   pid 696055, uptime 0:41:37

# ps aux | grep sf
root      696055  8.7  1.1 1886728 187856 ?      Sl   14:03   3:39 sf -serve localhost:5138

The issue only occurs when we run many threads against siegfried server using RODA Web application. I tried to reproduce the issue using command-line tools, but could not make it fail in the same way. We use base64 encoding of path, but I tried both just in case.

# apt-get update; apt-get install -y parallel libany-uri-escape-perl
# find / -type f -not -path '/proc/*' -not -path '/sys/*' -not -empty | tee /tmp/filelist | wc -l
find: ‘/proc/10/map_files’: Permission denied
19195

# parallel --progress --linebuffer 'wget -q  --server-response -O /dev/null "http://localhost:5138/identify/{= use URI::Escape; $_=uri_escape($_) =}?base64=false&format=json" 2>&1 | awk "/^  HTTP/{print $2}" ' :::: /tmp/filelist | sort | uniq -c

Computers / CPU cores / Max jobs to run
1:local / 8 / 8

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:0/19195/100%/0.0s 
  19195   HTTP/1.1 200 OK

# cat /tmp/filelist | parallel --progress --linebuffer 'echo "http://localhost:5138/identify/{= use MIME::Base64; $_=encode_base64($_, ""); =}?base64=true&format=json" | sed s/\\\\//g | xargs -I% wget -q  --server-response -O /dev/null "%" 2>&1 | awk "/^  HTTP/{print $2}"' | sort | uniq -c

Computers / CPU cores / Max jobs to run
1:local / 8 / 8

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:0/19195/100%/0.0s 
  19195   HTTP/1.1 200 OK

# cat /var/log/supervisor/siegfried-stderr-*.log
No relevant errors

But the error occurs consistently in production servers when using siegfried 1.9.1. There is also some correlation between panics and a higher amount of scans in empty files, but I am not sure that is relevant.

# grep panic -A 10  /var/log/supervisor/siegfried-stderr---supervisor-70990jtu.log 
panic: sync: negative WaitGroup counter

goroutine 20 [running]:
sync.(*WaitGroup).Add(0xc005f55ac0, 0xffffffffffffffff)
    /home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:74 +0x147
sync.(*WaitGroup).Done(...)
    /home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:99
main.printer(0xc00088d440, 0xc00010f7c0)
    /home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:168 +0x145
created by main.main
    /home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:423 +0xceb
--
panic: sync: negative WaitGroup counter

goroutine 5 [running]:
sync.(*WaitGroup).Add(0xc0058859b0, 0xffffffffffffffff)
    /home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:74 +0x147
sync.(*WaitGroup).Done(...)
    /home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:99
main.printer(0xc00080d860, 0xc0001017c0)
    /home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:168 +0x145
created by main.main
    /home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:423 +0xceb
--
panic: sync: negative WaitGroup counter

goroutine 5 [running]:
sync.(*WaitGroup).Add(0xc005f7f970, 0xffffffffffffffff)
    /home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:74 +0x147
sync.(*WaitGroup).Done(...)
    /home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:99
main.printer(0xc00018b4a0, 0xc0001e0f00)
    /home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:168 +0x145
created by main.main
    /home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:423 +0xceb
--
panic: sync: negative WaitGroup counter

goroutine 8 [running]:
sync.(*WaitGroup).Add(0xc0063ff600, 0xffffffffffffffff)
    /home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:74 +0x147
sync.(*WaitGroup).Done(...)
    /home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:99
main.printer(0xc000830c00, 0xc000101770)
    /home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:168 +0x145
created by main.main
    /home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:423 +0xceb

# cat /var/log/supervisor/siegfried-stderr-*.log | grep "empty source" | wc -l
1065

We will continue to try to reproduce the issue in a way we can provide it to you. Thank you

luis100 commented 2 years ago

Expanding testing into a production server using a much larger ammount of files, errors start to creep in. There are some other failures in the log, all complaining about empty files, and there are some 404 errors that come from files with a character '~' on them, not sure why, but many failures are due to panic errors. On a panic error, the service shuts down, but supervisord restarts it.

The panic errors come in pairs, or tupples, and seems to be this is a unprobable race condition.

# find /roda/data/storage/aip/ -type f | parallel --progress --linebuffer 'wget -q  --server-response -O /dev/null "http://localhost:5138/identify/{= use URI::Escape; $_=uri_escape($_) =}?base64=false&format=json" 2>&1 | awk "/^  HTTP/{print $2}"' | sort | uniq -c

Computers / CPU cores / Max jobs to run
1:local / 4 / 4

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:0/704393/100%/0.0s 
 702492   HTTP/1.1 200 OK
   1901   HTTP/1.1 404 Not Found

# grep panic   /var/log/supervisor/siegfried-stderr---supervisor-70990jtu.log | wc -l
35

# grep panic -A 10  /var/log/supervisor/siegfried-stderr---supervisor-70990jtu.log
panic: sync: negative WaitGroup counter

goroutine 20 [running]:
sync.(*WaitGroup).Add(0xc005f55ac0, 0xffffffffffffffff)
    /home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:74 +0x147
sync.(*WaitGroup).Done(...)
    /home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:99
main.printer(0xc00088d440, 0xc00010f7c0)
    /home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:168 +0x145
created by main.main
    /home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:423 +0xceb
--
panic: sync: negative WaitGroup counter

goroutine 5 [running]:
sync.(*WaitGroup).Add(0xc0058859b0, 0xffffffffffffffff)
    /home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:74 +0x147
sync.(*WaitGroup).Done(...)
    /home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:99
main.printer(0xc00080d860, 0xc0001017c0)
    /home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:168 +0x145
created by main.main
    /home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:423 +0xceb
--

richardlehane commented 2 years ago

Sorry for my delay in resolving this, it is quite a tricky one!

I have made some progress:

I can reliably reproduce this locally now with this client program: https://github.com/richardlehane/stresssf
I haven't been able to find the race condition yet but do have an idea for a general change to the server that may fix it (currently there is a universal printer goroutine that all output is sent through, with http requests it may be safer assign each request its own printer in case of dropped requests etc). The stresssf program times each run so I can also see impact of this change on performance.

Hopefully good news soon :)

richardlehane commented 2 years ago

I've found a quick fix for you: rather than run with -multi 32, run your server with -multi 1. I.e. sf -multi 1 -serve localhost:8080 You can make this the default with sf -multi 1 -setconf. I found in this mode I wasn't able to crash the server with my script, even with a large corpus size.

Changing this setting should only impact performance if you are giving siegfried directories to scan. If each of your http requests sends a single filename then it shouldn't make any difference to the speed & you get concurrency anyway because you can be sending multiple requests at a time.

luis100 commented 2 years ago

Thank you for your update, we will try the quick fix for now, let us know when a there's a final solution to the issue.

richardlehane commented 2 years ago

I found and fixed a race condition in the code over the weekend that in my testing has resolved this bug. Please re-open if you see it again.

You can access the fix in the latest release (v1.9.2). I'll be updating the debian package repository this evening. You download the new release here: https://github.com/richardlehane/siegfried/releases/tag/v1.9.2

Thanks again for taking the time to report this.

richardlehane / siegfried

Panic error when running siegfried #172