mnunberg / perl-Couchbase-Client

Perl Client for Couchbase
http://www.couchbase.com
16 stars 11 forks source link

Setting local timeout causes death #24

Closed ivulfson closed 10 years ago

ivulfson commented 10 years ago

Test script, with relevant line numbers:

    25      use Carp;
    26      $SIG{__DIE__} = sub { Carp::cluck @_; };
    27  
    28      my $doc = Couchbase::Document->new('test');
    29      {
    30          local $CB->settings->{operation_timeout} = 1;
    31          $CB->get($doc);
    32      }
    33      print "test doc version: " . $doc->value->{version} . "\n";

Output with local:

creating test doc
Passed empty value at /usr/local/lib64/perl5/Couchbase/Settings.pm line 38.
 at ./t line 26
    main::__ANON__('Passed empty value at /usr/local/lib64/perl5/Couchbase/Settin...') called at /usr/local/lib64/perl5/Couchbase/Settings.pm line 38
    Couchbase::Settings::__ANON__('Couchbase::Bucket=SCALAR(0xf11f08)', undef) called at /usr/local/lib64/perl5/Couchbase/Settings.pm line 76
    Couchbase::Settings::STORE('Couchbase::Settings=HASH(0xf11f20)', 'operation_timeout', undef) called at ./t line 30
    main::main() called at ./t line 15
Passed empty value at /usr/local/lib64/perl5/Couchbase/Settings.pm line 38.

Output without local:

creating test doc
test doc version: 1

One can work around this by saving current value of operation_timeout, changing it, and restoring it to the previous value after get(), but... :)

ivulfson commented 10 years ago

Hm. Also looks like changing the operation_timeout setting doesn't always work. I have this code when connecting to the bucket:

    my $CB = Couchbase::Bucket->new("couchbase://$cfg->{cb_server}/$cfg->{cb_bucket}", { password => $cfg->{cb_password} });

    $CB->settings->{http_timeout}      = $cfg->{cb_http_timeout} if $cfg->{cb_http_timeout};
    $CB->settings->{kv_timeout}        = $cfg->{cb_kv_timeout}   if $cfg->{cb_kv_timeout};
    $CB->settings->{operation_timeout} = $cfg->{cb_op_timeout}   if $cfg->{cb_op_timeout};
    $CB->settings->{view_timeout}      = $cfg->{cb_view_timeout} if $cfg->{cb_view_timeout};

Here, cb_op_timeout is 30 seconds. Later on, I do a batched insert() on 100k documents, setting operation_timeout (using the work-around above) to 300 seconds. However, I'm getting Client-Side timeout (errnum = 23) after 30 seconds anyway on a part of the documents. But when I sleep and retry the insert() on the failed documents, I get The key already exists in the server (errnum = 12).

mnunberg commented 10 years ago

The rest is a bug. I can answer for errnum=12, though.

The timeout is a client side timeout in which it failed to receive a reply from the server in due time. Now of course, this does not mean the item was not set, but that the local network was not quick enough to receive an acknowledgement from the server.

I'll continue investigating the rest. I could not repro on my current env (OS X), so I will check in EL6 and see what gives

mnunberg commented 10 years ago

Hrm, I could still not replicate this, even on CentOS with perl v5.10.1. I would imagine there may be a bug, but this bug would reside elsewhere in the code.

Can you provide a self contained example to replicate this?

According to the XS code:

    if (!SvOK(value)) {
        die("Passed empty value");
    }
mnunberg commented 10 years ago

And I got somehting to replicate it, so nvm :)

mnunberg commented 10 years ago

Now I feel really stupid:

From http://search.cpan.org/~rjbs/perl-5.18.4/pod/perltie.pod:

BUGS ^

The bucket usage information provided by scalar(%hash) is not available. What this means is that using %tied_hash in boolean context doesn't work right (currently this always tests false, regardless of whether the hash is empty or hash elements).

Localizing tied arrays or hashes does not work. After exiting the scope the arrays or the hashes are not restored.
ivulfson commented 10 years ago

Well, that's fairly obscure... News to me as well. A built-in library that doesn't work right. Heh.

mnunberg commented 10 years ago

I think I found the issue.. It's a bit obscure and is specific to older Perls. It has to do with localizing values. Apparently older versions do something like:

  1. Get the old value
  2. Set the value to undef
  3. Finally set the new value

While newer versions seem to skip step 2.

ivulfson commented 10 years ago

That doesn't explain why I'm getting client-side timeouts after 25 seconds (that's what it's been during the last few tests I ran) when I'm setting operation_timeout to 300 seconds.

        my $op_timeout = $CB->settings->{operation_timeout};
        $CB->settings->{operation_timeout} = 300;

        my $batch = $CB->batch;
        if ($mode == MODE_REPLACE) { $batch->replace($_, $args) for @docs; }
        elsif ($mode == MODE_INSERT) { $batch->insert($_, $args) for @docs; }
        elsif ($mode == MODE_UPSERT) { $batch->upsert($_, $args) for @docs; }
        elsif ($mode == MODE_REMOVE) { $batch->remove($_, $args) for @docs; }

        $batch->wait_all;

        $CB->settings->{operation_timeout} = $op_timeout;

        my %errors  = (); 
        my @tmpfail = ();
        foreach my $doc (@docs) {
            if ($doc->is_ok) {
                # NOOP
            } elsif ($doc->is_not_found || $doc->is_cas_mismatch || $doc->is_already_exists) {
                # collision, need to redo
                warn "Collision $doing " . $doc->id . ", need to redo: " . $doc->errstr . " (" . $doc->errnum . ")\n";
                push @redo_ids, $doc->id;
            } elsif ($doc->errnum == COUCHBASE_ETMPFAIL
                || $doc->errnum == COUCHBASE_CLIENT_ETMPFAIL
                || $doc->errnum == COUCHBASE_NETWORK_ERROR
                || $doc->errnum == COUCHBASE_ETIMEDOUT)
            {
                # tmpfail, will redo
                my $error = $doc->errstr . " (" . $doc->errnum . ")";
                warn "Temporary failure $doing " . $doc->id . ", will redo (other similar errors are silenced): $error\n" if !$errors{$error}++;
                push @tmpfail, $doc;
            } else {
                die "Error $doing " . $doc->id . ": " . $doc->errstr . " (" . $doc->errnum . ")\n";
            }
        }

What I get is, about 30 second delay during wait_all, and then about 90% of the documents get client-side failure. I then sleep 5 and run through the @tmpfail docs again, only to get Collision warnings for all the documents. So, somehow, Couchbase wrote all 100k documents (that's the batch size).

mnunberg commented 10 years ago

One thing that might address your issue. Keep in mind that timeouts are set from the moment the command is created, e.g. $cb->upsert(), and not from the time $batch->wait_all is called. If you are indeed loading 100k documents, mind the actual wall time it's taking for your script to load this all into memory.

You'd probably do better partitioning your batches (maybe via splice) so that it amounts to 1k per batch. It's better on the network that way anyway.

mnunberg commented 10 years ago

See if the above patch fixes your issue wiht localized hashes.

Apparently, it's only localized hashes themselves that cannot be tied. Their elements are OK.

ivulfson commented 10 years ago

Yes, local now works.

This is still weird:

Thu Oct 16 21:41:10 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
Thu Oct 16 21:41:11 2014 DEBUG: Mine::Couchbase::cb_write: Waiting for 100000 docs
Thu Oct 16 21:41:35 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
Thu Oct 16 21:41:35 2014 WARNING: Temporary failure inserting /KEY/, will redo (other similar errors are silenced): Client-Side timeout exceeded for operation. Inspect 
Thu Oct 16 21:41:36 2014 WARNING: Found 77866 errors: Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout (23)
Thu Oct 16 21:41:36 2014 WARNING: Temporary failure inserting 77866 docs, will redo
Thu Oct 16 21:41:36 2014 DEBUG: Mine::Couchbase::cb_write: Sleeping 5 seconds before redo
Thu Oct 16 21:41:41 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 77866 docs
Thu Oct 16 21:41:42 2014 DEBUG: Mine::Couchbase::cb_write: Waiting for 77866 docs
Thu Oct 16 21:41:42 2014 DEBUG: Mine::Couchbase::cb_write: Checking 77866 docs
Thu Oct 16 21:41:42 2014 WARNING: Collision inserting /KEY/, need to redo (other similar errors are silenced): The key already exists in the server. If you have supplie
Thu Oct 16 21:41:42 2014 WARNING: Found 77866 errors: The key already exists in the server. If you have supplied a CAS then the key exists with a CAS value different th
Thu Oct 16 21:41:42 2014 DEBUG: Mine::Couchbase::cb_write: Made changes to 22134 docs, need to redo 77866 docs

Watch the time differences. The $batch->insert() happen after "Inserting" log message. So, 25 seconds between "Inserting" and "Checking".

log("inserting...");
$batch->insert($_) for @docs;
log("waiting...");
$batch->wait_all;
log("checking..."): # this is 25 seconds after insert() is done
for() loop finds 77k client-side errors out of 100k docs
sleep 5
log("inserting...");
$batch->insert($_) for @docs;
log("waiting...");
$batch->wait_all;
log("checking..."): # this is 25 seconds after insert() is done
for() loop finds that all 77k docs are already in the bucket

I'm specifically setting operation_timeout = 300 before $batch->insert() is called.

ivulfson commented 10 years ago

I'm not sure if it's a related issue, but I can't affect the code behavior by setting different timeout values.

mnunberg commented 10 years ago

You aren't using any of the persist_to or replicate_to here, are you? Unfortunately those cover a different timeout setting

mnunberg commented 10 years ago

I'd also be curious to see what happens if you set LCB_LOGLEVEL=5 in your environment to get the client logs..

ivulfson commented 10 years ago

Yes, I am indeed using persist_to=1 and replicate_to=0. Doh! OK, set durability_timeout to 300 as well now. That took care of the client-side timeout issue! - now 100k insert()s finish in around 150-170 seconds.

For some reason I started seeing timeouts after only a few seconds - the first few times I tried updated code. But this went away after I flushed the bucket a few times and retried. Can't replicate it now, so assuming working until proven otherwise.

LCB_LOGLEVEL=5 is a neat trick.

{   
    $log->debug("Setting operation timeout to $cfg->{cb_batch_timeout}") if $log && $cfg && $cfg->{cb_batch_timeout};
    local $CB->settings->{operation_timeout} = $cfg->{cb_batch_timeout} if $cfg && $cfg->{cb_batch_timeout};
    $log->debug("Setting durability timeout to $cfg->{cb_batch_timeout}")
      if $log && $cfg && $cfg->{cb_batch_timeout} && ($args->{persist_to} || $args->{replicate_to});
    local $CB->settings->{durability_timeout} = $cfg->{cb_batch_timeout}
      if $cfg && $cfg->{cb_batch_timeout} && ($args->{persist_to} || $args->{replicate_to});

    my $batch = $CB->batch;
    if ($mode == MODE_REPLACE) { $batch->replace($_, $args) for @docs; }
    elsif ($mode == MODE_INSERT) { $batch->insert($_, $args) for @docs; }
    elsif ($mode == MODE_UPSERT) { $batch->upsert($_, $args) for @docs; }
    elsif ($mode == MODE_REMOVE) { $batch->remove($_, $args) for @docs; }

    $log->debug("Waiting for " . @docs . " docs") if $log;
    $batch->wait_all;
}   

Thank you! I'm closing this bug now.

ivulfson commented 10 years ago

That was premature... It starts off fine, but after it inserts about 1M keys, the timeouts start mis-behaving. Here's the output with the relevant log lines, with time diffs in seconds as the first columns in the output:

$ egrep 'Inserting|Checking|WARN' 1|td
      Fri Oct 17 10:12:41 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
 164  Fri Oct 17 10:15:25 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
  10  Fri Oct 17 10:15:35 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
 167  Fri Oct 17 10:18:22 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
   9  Fri Oct 17 10:18:31 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
 166  Fri Oct 17 10:21:17 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
  12  Fri Oct 17 10:21:29 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
 161  Fri Oct 17 10:24:10 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
   8  Fri Oct 17 10:24:18 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
 153  Fri Oct 17 10:26:51 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
   8  Fri Oct 17 10:26:59 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
 156  Fri Oct 17 10:29:35 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
   8  Fri Oct 17 10:29:43 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
 153  Fri Oct 17 10:32:16 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
   8  Fri Oct 17 10:32:24 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
 192  Fri Oct 17 10:35:36 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
   9  Fri Oct 17 10:35:45 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
 176  Fri Oct 17 10:38:41 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
   8  Fri Oct 17 10:38:49 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
 104  Fri Oct 17 10:40:33 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
# So, only 104 seconds after next batch of 100k insert()s
   0  Fri Oct 17 10:40:33 2014 WARNING: Temporary failure inserting key103, will redo (other similar errors are silenced): Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout (23)
   0  Fri Oct 17 10:40:33 2014 WARNING: Found 17234 errors: Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout (23)
   0  Fri Oct 17 10:40:33 2014 WARNING: Temporary failure inserting 17234 docs, will redo
   5  Fri Oct 17 10:40:38 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 17234 docs
   1  Fri Oct 17 10:40:39 2014 DEBUG: Mine::Couchbase::cb_write: Checking 17234 docs
# Tried to re-insert() the same 17234 docs, this time got a collision
   0  Fri Oct 17 10:40:39 2014 WARNING: Collision inserting key103, need to redo (other similar errors are silenced): The key already exists in the server. If you have supplied a CAS then the key exists with a CAS value different than specified (12)
   0  Fri Oct 17 10:40:39 2014 WARNING: Found 17234 errors: The key already exists in the server. If you have supplied a CAS then the key exists with a CAS value different than specified (12)
  15  Fri Oct 17 10:40:54 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
   7  Fri Oct 17 10:41:01 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
# Inserting next batch of 100k docs, got a timeout only 7 seconds later
   0  Fri Oct 17 10:41:01 2014 WARNING: Temporary failure inserting key201, will redo (other similar errors are silenced): Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout (23)
   0  Fri Oct 17 10:41:01 2014 WARNING: Found 100000 errors: Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout (23)
   0  Fri Oct 17 10:41:01 2014 WARNING: Temporary failure inserting 100000 docs, will redo
   5  Fri Oct 17 10:41:06 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
   3  Fri Oct 17 10:41:09 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
# And now all 100k got a collision
   0  Fri Oct 17 10:41:09 2014 WARNING: Collision inserting key201, need to redo (other similar errors are silenced): The key already exists in the server. If you have supplied a CAS then the key exists with a CAS value different than specified (12)
   1  Fri Oct 17 10:41:10 2014 WARNING: Found 100000 errors: The key already exists in the server. If you have supplied a CAS then the key exists with a CAS value different than specified (12)
  15  Fri Oct 17 10:41:25 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
   6  Fri Oct 17 10:41:31 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
# and again, 100k insert()s timeout only 6 seconds later
   0  Fri Oct 17 10:41:31 2014 WARNING: Temporary failure inserting key117, will redo (other similar errors are silenced): Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout (23)
   0  Fri Oct 17 10:41:31 2014 WARNING: Found 100000 errors: Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout (23)
   0  Fri Oct 17 10:41:31 2014 WARNING: Temporary failure inserting 100000 docs, will redo
   5  Fri Oct 17 10:41:36 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
   3  Fri Oct 17 10:41:39 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
   0  Fri Oct 17 10:41:39 2014 WARNING: Collision inserting key117, need to redo (other similar errors are silenced): The key already exists in the server. If you have supplied a CAS then the key exists with a CAS value different than specified (12)
   1  Fri Oct 17 10:41:40 2014 WARNING: Found 100000 errors: The key already exists in the server. If you have supplied a CAS then the key exists with a CAS value different than specified (12)
  14  Fri Oct 17 10:41:54 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
   7  Fri Oct 17 10:42:01 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
   0  Fri Oct 17 10:42:01 2014 WARNING: Temporary failure inserting key146, will redo (other similar errors are silenced): Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout (23)
   0  Fri Oct 17 10:42:01 2014 WARNING: Found 100000 errors: Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout (23)
   0  Fri Oct 17 10:42:01 2014 WARNING: Temporary failure inserting 100000 docs, will redo
   5  Fri Oct 17 10:42:06 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
   4  Fri Oct 17 10:42:10 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
   0  Fri Oct 17 10:42:10 2014 WARNING: Collision inserting key146, need to redo (other similar errors are silenced): The key already exists in the server. If you have supplied a CAS then the key exists with a CAS value different than specified (12)
   0  Fri Oct 17 10:42:10 2014 WARNING: Found 100000 errors: The key already exists in the server. If you have supplied a CAS then the key exists with a CAS value different than specified (12)
ivulfson commented 10 years ago

This is running against 3-node cluster, with 2 replicas, persist_to=>3, replicate_to=>2, and both operation_timeout and durability_timeout being set to 600 every time before the insert()s are performed in batch.

ivulfson commented 10 years ago

I'm working on a self-contained test for you.

mnunberg commented 10 years ago

Would be interesting to see the client logs on this (might want to filter out any 'STALE COMMAND' or similar message).

Please set the detailed_errcodes=1 option in the connection string as well (I should probably expose it via ->settings as well). This may also help shed some light.

For your use case, I should probably add an endure() method (the persist_to and replicate_to parameters are quite inefficient for large bulk operations, as they generate quite a bit of traffic when polling).

ivulfson commented 10 years ago

OK, I have a self-contained test which reproduces this. I opened a new issue #25 with full code and output.