Closed hmiguim closed 2 years ago
thanks for reporting this Miguel, I'll investigate
@richardlehane any updates on this? Currently, to workaround this issue, we've downgraded Siegfried to 1.7.13 where the issue doesn't seem to affect us.
hi @luis100 @hmiguim I am wondering if you have any more information about what's triggering the error? I have reinstalled 1.9.1 on Ubuntu 21.04, Go 1.16.4 and I am able to return through GET and POST requests on the command line and through the web UI.
Before running I ran sf -update
to make sure the signature file was up to date as well.
Example GET:
http://localhost:5138/identify/%2Ftmp%2Ftest?format=json
Example POST:
curl "http://localhost:5138/identify?format=json&hash=crc" -F file=@/tmp/test
Perhaps your command, operating environment, or build are having an impact here?
NB. It looks like the error begins here. We can see what happens when we step through it with more info.
NB++. I have a windows installation available, running 1.9.1 from the releases, I can also run the following in the browser (I haven't access to cURL on this machine to try that):
C:\Users\RS>sf -serve localhost:5381
http://127.0.0.1:5381/identify/C%3A%5Ctemp%5Cgifs%5C1880412.png?format=json
We are using a docker image based on Debian where we install siegfried and use it in server mode.
docker run -it --rm keeps/roda:v4.1.1 bash
# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 11 (bullseye)
Release: 11
Codename: bullseye
# sf -version
siegfried 1.9.1
/usr/share/siegfried/default.sig (2020-10-06T19:13:40+02:00)
identifiers:
- pronom: DROID_SignatureFile_V97.xml; container-signature-20201001.xml
config:
- multi: 32
# supervisorctl status siegfried
siegfried RUNNING pid 696055, uptime 0:41:37
# ps aux | grep sf
root 696055 8.7 1.1 1886728 187856 ? Sl 14:03 3:39 sf -serve localhost:5138
The issue only occurs when we run many threads against siegfried server using RODA Web application. I tried to reproduce the issue using command-line tools, but could not make it fail in the same way. We use base64 encoding of path, but I tried both just in case.
# apt-get update; apt-get install -y parallel libany-uri-escape-perl
# find / -type f -not -path '/proc/*' -not -path '/sys/*' -not -empty | tee /tmp/filelist | wc -l
find: ‘/proc/10/map_files’: Permission denied
19195
# parallel --progress --linebuffer 'wget -q --server-response -O /dev/null "http://localhost:5138/identify/{= use URI::Escape; $_=uri_escape($_) =}?base64=false&format=json" 2>&1 | awk "/^ HTTP/{print $2}" ' :::: /tmp/filelist | sort | uniq -c
Computers / CPU cores / Max jobs to run
1:local / 8 / 8
Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:0/19195/100%/0.0s
19195 HTTP/1.1 200 OK
# cat /tmp/filelist | parallel --progress --linebuffer 'echo "http://localhost:5138/identify/{= use MIME::Base64; $_=encode_base64($_, ""); =}?base64=true&format=json" | sed s/\\\\//g | xargs -I% wget -q --server-response -O /dev/null "%" 2>&1 | awk "/^ HTTP/{print $2}"' | sort | uniq -c
Computers / CPU cores / Max jobs to run
1:local / 8 / 8
Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:0/19195/100%/0.0s
19195 HTTP/1.1 200 OK
# cat /var/log/supervisor/siegfried-stderr-*.log
No relevant errors
But the error occurs consistently in production servers when using siegfried 1.9.1. There is also some correlation between panics and a higher amount of scans in empty files, but I am not sure that is relevant.
# grep panic -A 10 /var/log/supervisor/siegfried-stderr---supervisor-70990jtu.log
panic: sync: negative WaitGroup counter
goroutine 20 [running]:
sync.(*WaitGroup).Add(0xc005f55ac0, 0xffffffffffffffff)
/home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:74 +0x147
sync.(*WaitGroup).Done(...)
/home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:99
main.printer(0xc00088d440, 0xc00010f7c0)
/home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:168 +0x145
created by main.main
/home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:423 +0xceb
--
panic: sync: negative WaitGroup counter
goroutine 5 [running]:
sync.(*WaitGroup).Add(0xc0058859b0, 0xffffffffffffffff)
/home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:74 +0x147
sync.(*WaitGroup).Done(...)
/home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:99
main.printer(0xc00080d860, 0xc0001017c0)
/home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:168 +0x145
created by main.main
/home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:423 +0xceb
--
panic: sync: negative WaitGroup counter
goroutine 5 [running]:
sync.(*WaitGroup).Add(0xc005f7f970, 0xffffffffffffffff)
/home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:74 +0x147
sync.(*WaitGroup).Done(...)
/home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:99
main.printer(0xc00018b4a0, 0xc0001e0f00)
/home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:168 +0x145
created by main.main
/home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:423 +0xceb
--
panic: sync: negative WaitGroup counter
goroutine 8 [running]:
sync.(*WaitGroup).Add(0xc0063ff600, 0xffffffffffffffff)
/home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:74 +0x147
sync.(*WaitGroup).Done(...)
/home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:99
main.printer(0xc000830c00, 0xc000101770)
/home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:168 +0x145
created by main.main
/home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:423 +0xceb
# cat /var/log/supervisor/siegfried-stderr-*.log | grep "empty source" | wc -l
1065
We will continue to try to reproduce the issue in a way we can provide it to you. Thank you
Expanding testing into a production server using a much larger ammount of files, errors start to creep in. There are some other failures in the log, all complaining about empty files, and there are some 404 errors that come from files with a character '~' on them, not sure why, but many failures are due to panic errors. On a panic error, the service shuts down, but supervisord restarts it.
The panic errors come in pairs, or tupples, and seems to be this is a unprobable race condition.
# find /roda/data/storage/aip/ -type f | parallel --progress --linebuffer 'wget -q --server-response -O /dev/null "http://localhost:5138/identify/{= use URI::Escape; $_=uri_escape($_) =}?base64=false&format=json" 2>&1 | awk "/^ HTTP/{print $2}"' | sort | uniq -c
Computers / CPU cores / Max jobs to run
1:local / 4 / 4
Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
local:0/704393/100%/0.0s
702492 HTTP/1.1 200 OK
1901 HTTP/1.1 404 Not Found
# grep panic /var/log/supervisor/siegfried-stderr---supervisor-70990jtu.log | wc -l
35
# grep panic -A 10 /var/log/supervisor/siegfried-stderr---supervisor-70990jtu.log
panic: sync: negative WaitGroup counter
goroutine 20 [running]:
sync.(*WaitGroup).Add(0xc005f55ac0, 0xffffffffffffffff)
/home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:74 +0x147
sync.(*WaitGroup).Done(...)
/home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:99
main.printer(0xc00088d440, 0xc00010f7c0)
/home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:168 +0x145
created by main.main
/home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:423 +0xceb
--
panic: sync: negative WaitGroup counter
goroutine 5 [running]:
sync.(*WaitGroup).Add(0xc0058859b0, 0xffffffffffffffff)
/home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:74 +0x147
sync.(*WaitGroup).Done(...)
/home/travis/.gimme/versions/go1.15.2.linux.amd64/src/sync/waitgroup.go:99
main.printer(0xc00080d860, 0xc0001017c0)
/home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:168 +0x145
created by main.main
/home/travis/gopath/src/github.com/richardlehane/siegfried/cmd/sf/sf.go:423 +0xceb
--
Sorry for my delay in resolving this, it is quite a tricky one!
I have made some progress:
Hopefully good news soon :)
I've found a quick fix for you: rather than run with -multi 32
, run your server with -multi 1
. I.e. sf -multi 1 -serve localhost:8080
You can make this the default with sf -multi 1 -setconf
.
I found in this mode I wasn't able to crash the server with my script, even with a large corpus size.
Changing this setting should only impact performance if you are giving siegfried directories to scan. If each of your http requests sends a single filename then it shouldn't make any difference to the speed & you get concurrency anyway because you can be sending multiple requests at a time.
Thank you for your update, we will try the quick fix for now, let us know when a there's a final solution to the issue.
I found and fixed a race condition in the code over the weekend that in my testing has resolved this bug. Please re-open if you see it again.
You can access the fix in the latest release (v1.9.2). I'll be updating the debian package repository this evening. You download the new release here: https://github.com/richardlehane/siegfried/releases/tag/v1.9.2
Thanks again for taking the time to report this.
Siegfried version: 1.9.1 in server mode
siegfried-stderr.log: