Open esabol opened 1 month ago
Hmmm... After making the above change, I've eliminated the crashes, but I think $tasks->wait
hangs indefinitely now when this error is encountered. 😞
Hi Ed, it's a while I've read Perl code. I suppose the problem occurs if Gearman::Util::_read_sock ends with return (0, "eof") unless $rv;
because sysread returns a valid length of 0
and subsequently Gearman::Util::read_res_packet ends in
elsif ($ok == 0) {
return $err->($err_code);
}
Do I see that correctly that the condition of elsif
should be enhanced $err_code ne "eof" by?
elsif ($ok == 0 && $err_code ne "eof") {
return $err->($err_code);
}
Thanks for your reply, Alexei. I realize you've moved on from Perl programming, so I appreciate you taking the time.
Let's say we change
elsif ($ok == 0) {
return $err->($err_code);
}
to
elsif ($ok == 0 && $err_code ne "eof") {
return $err->($err_code);
}
as you suggested. What happens in the case of $ok == 0 && $err_code eq "eof"
? It just keeps going?
Without testing, my thinking is we need to get out of the loop? Maybe with a "next" or "last"?
elsif ($ok == 0) {
next if $err_code eq 'eof'; # or last?
return $err->($err_code);
}
What do you think?
Getting back to Gearman/ResponseParser.pm
line 175, I've done various tests, and I'm convinced it should just be:
$self->on_error("read_error: $err");
$err
will never be a reference to anything there.
Whether or not 'eof' should be considered an error condition and how 'eof' should be handled are separate issues.
I've tested all of the following prospective changes in Gearman/Util.pm
:
elsif ($ok == 0 && $err_code ne "eof") {
return $err->($err_code);
}
This definitely doesn't work. Every task hangs.
Next, I tried this:
elsif ($ok == 0) {
next if $err_code eq "eof";
return $err->($err_code);
}
This resulted in an infinite loop when "eof" is encountered.
This seems to work well, but I'm not sure it's actually an improvement:
elsif ($ok == 0) {
last if $err_code eq "eof";
return $err->($err_code);
}
I still get the occasional hang (about one in 500 tasks). I turned on DEBUG and added a bunch of additional DEBUG statements to Gearman::Util::read_res_packet()
and _read_sock()
. I'm not seeing any "eof" (or other) error codes when these hangs occur, so I'm not sure what's going on. I'm not sure where it's hanging? It doesn't seem to be in Gearman::Util::read_res_packet()
. I think it's elsewhere. I'll keep digging. It's probably due to a gearmand bug, I'm guessing.
Hi Ed,
What happens in the case of $ok == 0 && $err_code eq "eof"? It just keeps going? Without testing, my thinking is we need to get out of the loop? Maybe with a "next" or "last"? I'm not quite sure about loop interruption. I guess in "eof" case
Gearman::Util::read_res_packet
should proceed til Gearman/Util.pm#L196-L208.I'm not sure where it's hanging? Something changed indeed. I've seen two different error patterns:
- for Perl < 5.38
unexpected packet type: error [JOB_NOT_FOUND -- Job given in work result not found] unexpected packet type: error [JOB_NOT_FOUND -- Job given in work result not found]
- for Perl 5.38
t/15-priority.t ................. ok ERROR 2024-07-16 11:58:51.000000 [ 3 ] realloc(char, count: 1844674407370[9](https://github.com/p-alik/perl-Gearman/actions/runs/9956350145/job/27506063338#step:11:10)551615 size: 1) -> libgearman-server/io.cc:835
But independently from the Perl version the tests are hanging with gearman-job-server amd64 1.1.19.1
I've seen two different error patterns:
Do both of these happen all the time? Or is it just a rare random failure?
I'm just seeing the rare random hang. Since yesterday, my testing would seem to indicate that this is due to a Perl worker getting pegged near 100% CPU. So it's not the client where the hang is happening, apparently. The client is just waiting for the job result as it should.
But independently from the Perl version the tests are hanging with gearman-job-server amd64 1.1.19.1
Well, 1.1.21 is the latest version of gearmand. There have been a lot of commits since 1.1.19.1, but I'm not sure what might be the cause there.
Are you using localhost
in your GitHub Actions CI? Some of those errors I'm seeing in the logs of your GitHub Actions CI might be the fault of the GitHub Actions runner, actually. This might explain why you're seeing new failures. GitHub Actions has been a thorn in my side ever since we had to stop using Travis CI. Things fail in GitHub Actions that don't fail on any real system, especially networking things. For example, the GitHub Actions runners don't support IPv6 when running inside a Docker container, but the /etc/hosts
on most Linux distros assign ::1
the name localhost
. This causes problems for any networking code that uses localhost
. For gearmand, I have to hack /etc/hosts
to remove localhost
from the definition of::1
in order to get the gearmand CI to pass the tests. It was so annoying to figure that out! You might need to do something similar if you're running gearmand inside your GitHub Actions CI. Here's what I used:
case "${GITHUB_ACTIONS_CONTAINER}" in
ubuntu*)
apt-get -o update && DEBIAN_FRONTEND=noninteractive apt-get -o Acquire::Retries=3 -y install iproute2
echo "==="
echo "Before: /etc/hosts"
cat /etc/hosts
echo "==="
echo "Removing localhost name from ::1 entry in /etc/hosts..."
sed 's/^::1\s\s*localhost\s\(.*\)/::1 \1/' /etc/hosts > /tmp/hosts.temp
cp /tmp/hosts.temp /etc/hosts
rm /tmp/hosts.temp
echo "After: /etc/hosts"
cat /etc/hosts
echo "==="
ip addr
echo "==="
echo "'hostname -i' shows '$(hostname -i)'"
echo "'hostname -I' shows '$(hostname -I)'"
echo "'hostname -s' shows '$(hostname -s)'"
echo "'hostname -f' shows '$(hostname -f)'"
;;
*)
;;
esac
Hi, Alexei!
Job server failure: Can't use string ("eof") as a SCALAR ref while "strict refs" in use at .../Gearman/ResponseParser.pm line 175.
I saw this error twice recently when I was testing adding a large number of tasks to a task set, each with "on_complete" callbacks. The jobs when processed by workers also "set_status" to update a completion percentage. It happens very rarely, and I think I've only ever seen this error with jobs that "set_status".
Gearman::Util::_read_sock
can return the string "eof", of course, but I don't see how$err
inGearman::ResponseParser::parse_sock()
can be anything but a reference to a SCALAR? I'm not seeing the bug! Do you?But just to prevent crashes, I've changed
Gearman/ResponseParser.pm
line 175 to this: