p-alik / perl-Gearman

9 stars 10 forks source link

Can't use string ("eof") as a SCALAR ref while "strict refs" in use at .../Gearman/ResponseParser.pm line 175. #50

Open esabol opened 1 month ago

esabol commented 1 month ago

Hi, Alexei!

Job server failure: Can't use string ("eof") as a SCALAR ref while "strict refs" in use at .../Gearman/ResponseParser.pm line 175.

I saw this error twice recently when I was testing adding a large number of tasks to a task set, each with "on_complete" callbacks. The jobs when processed by workers also "set_status" to update a completion percentage. It happens very rarely, and I think I've only ever seen this error with jobs that "set_status".

Gearman::Util::_read_sock can return the string "eof", of course, but I don't see how $err in Gearman::ResponseParser::parse_sock() can be anything but a reference to a SCALAR? I'm not seeing the bug! Do you?

But just to prevent crashes, I've changed Gearman/ResponseParser.pm line 175 to this:

        $self->on_error("read_error: " . (ref($err) eq 'SCALAR' ? ${$err} : $err));
esabol commented 1 month ago

Hmmm... After making the above change, I've eliminated the crashes, but I think $tasks->wait hangs indefinitely now when this error is encountered. 😞

p-alik commented 1 month ago

Hi Ed, it's a while I've read Perl code. I suppose the problem occurs if Gearman::Util::_read_sock ends with return (0, "eof") unless $rv; because sysread returns a valid length of 0 and subsequently Gearman::Util::read_res_packet ends in

elsif ($ok == 0) {
            return $err->($err_code);
}

Do I see that correctly that the condition of elsif should be enhanced $err_code ne "eof" by?

elsif ($ok == 0 && $err_code ne "eof") {
            return $err->($err_code);
}
esabol commented 1 month ago

Thanks for your reply, Alexei. I realize you've moved on from Perl programming, so I appreciate you taking the time.

Let's say we change

elsif ($ok == 0) {
            return $err->($err_code);
}

to

elsif ($ok == 0 && $err_code ne "eof") {
            return $err->($err_code);
}

as you suggested. What happens in the case of $ok == 0 && $err_code eq "eof"? It just keeps going?

Without testing, my thinking is we need to get out of the loop? Maybe with a "next" or "last"?

elsif ($ok == 0) {
            next if $err_code eq 'eof'; # or last?
            return $err->($err_code);
}

What do you think?

esabol commented 1 month ago

Getting back to Gearman/ResponseParser.pm line 175, I've done various tests, and I'm convinced it should just be:

        $self->on_error("read_error: $err");

$err will never be a reference to anything there.

Whether or not 'eof' should be considered an error condition and how 'eof' should be handled are separate issues.

I've tested all of the following prospective changes in Gearman/Util.pm:

elsif ($ok == 0 && $err_code ne "eof") {
            return $err->($err_code);
}

This definitely doesn't work. Every task hangs.

Next, I tried this:

elsif ($ok == 0) {
            next if $err_code eq "eof";
            return $err->($err_code);
}

This resulted in an infinite loop when "eof" is encountered.

This seems to work well, but I'm not sure it's actually an improvement:

elsif ($ok == 0) {
            last if $err_code eq "eof";
            return $err->($err_code);
}

I still get the occasional hang (about one in 500 tasks). I turned on DEBUG and added a bunch of additional DEBUG statements to Gearman::Util::read_res_packet() and _read_sock(). I'm not seeing any "eof" (or other) error codes when these hangs occur, so I'm not sure what's going on. I'm not sure where it's hanging? It doesn't seem to be in Gearman::Util::read_res_packet(). I think it's elsewhere. I'll keep digging. It's probably due to a gearmand bug, I'm guessing.

p-alik commented 1 month ago

Hi Ed,

What happens in the case of $ok == 0 && $err_code eq "eof"? It just keeps going? Without testing, my thinking is we need to get out of the loop? Maybe with a "next" or "last"? I'm not quite sure about loop interruption. I guess in "eof" case Gearman::Util::read_res_packet should proceed til Gearman/Util.pm#L196-L208.

I'm not sure where it's hanging? Something changed indeed. I've seen two different error patterns:

  • for Perl < 5.38
    unexpected packet type: error [JOB_NOT_FOUND -- Job given in work result not found]
    unexpected packet type: error [JOB_NOT_FOUND -- Job given in work result not found]
  • for Perl 5.38
    t/15-priority.t ................. ok
    ERROR 2024-07-16 11:58:51.000000 [     3 ] realloc(char, count: 1844674407370[9](https://github.com/p-alik/perl-Gearman/actions/runs/9956350145/job/27506063338#step:11:10)551615 size: 1) -> libgearman-server/io.cc:835

    But independently from the Perl version the tests are hanging with gearman-job-server amd64 1.1.19.1

esabol commented 1 month ago

I've seen two different error patterns:

Do both of these happen all the time? Or is it just a rare random failure?

I'm just seeing the rare random hang. Since yesterday, my testing would seem to indicate that this is due to a Perl worker getting pegged near 100% CPU. So it's not the client where the hang is happening, apparently. The client is just waiting for the job result as it should.

But independently from the Perl version the tests are hanging with gearman-job-server amd64 1.1.19.1

Well, 1.1.21 is the latest version of gearmand. There have been a lot of commits since 1.1.19.1, but I'm not sure what might be the cause there.

esabol commented 1 month ago

Are you using localhost in your GitHub Actions CI? Some of those errors I'm seeing in the logs of your GitHub Actions CI might be the fault of the GitHub Actions runner, actually. This might explain why you're seeing new failures. GitHub Actions has been a thorn in my side ever since we had to stop using Travis CI. Things fail in GitHub Actions that don't fail on any real system, especially networking things. For example, the GitHub Actions runners don't support IPv6 when running inside a Docker container, but the /etc/hosts on most Linux distros assign ::1 the name localhost. This causes problems for any networking code that uses localhost. For gearmand, I have to hack /etc/hosts to remove localhost from the definition of::1 in order to get the gearmand CI to pass the tests. It was so annoying to figure that out! You might need to do something similar if you're running gearmand inside your GitHub Actions CI. Here's what I used:

           case "${GITHUB_ACTIONS_CONTAINER}" in
              ubuntu*)
                 apt-get -o update && DEBIAN_FRONTEND=noninteractive apt-get -o Acquire::Retries=3 -y install iproute2
                 echo "==="
                 echo "Before: /etc/hosts"
                 cat /etc/hosts
                 echo "==="
                 echo "Removing localhost name from ::1 entry in /etc/hosts..."
                 sed 's/^::1\s\s*localhost\s\(.*\)/::1 \1/' /etc/hosts > /tmp/hosts.temp
                 cp /tmp/hosts.temp /etc/hosts
                 rm /tmp/hosts.temp
                 echo "After: /etc/hosts"
                 cat /etc/hosts
                 echo "==="
                 ip addr
                 echo "==="
                 echo "'hostname -i' shows '$(hostname -i)'"
                 echo "'hostname -I' shows '$(hostname -I)'"
                 echo "'hostname -s' shows '$(hostname -s)'"
                 echo "'hostname -f' shows '$(hostname -f)'"
                 ;;
              *)
                 ;;
           esac