Closed ivulfson closed 10 years ago
Hm. Also looks like changing the operation_timeout
setting doesn't always work. I have this code when connecting to the bucket:
my $CB = Couchbase::Bucket->new("couchbase://$cfg->{cb_server}/$cfg->{cb_bucket}", { password => $cfg->{cb_password} });
$CB->settings->{http_timeout} = $cfg->{cb_http_timeout} if $cfg->{cb_http_timeout};
$CB->settings->{kv_timeout} = $cfg->{cb_kv_timeout} if $cfg->{cb_kv_timeout};
$CB->settings->{operation_timeout} = $cfg->{cb_op_timeout} if $cfg->{cb_op_timeout};
$CB->settings->{view_timeout} = $cfg->{cb_view_timeout} if $cfg->{cb_view_timeout};
Here, cb_op_timeout
is 30 seconds. Later on, I do a batched insert()
on 100k documents, setting operation_timeout
(using the work-around above) to 300 seconds. However, I'm getting Client-Side timeout
(errnum = 23) after 30 seconds anyway on a part of the documents. But when I sleep and retry the insert()
on the failed documents, I get The key already exists in the server
(errnum = 12).
The rest is a bug. I can answer for errnum=12, though.
The timeout is a client side timeout in which it failed to receive a reply from the server in due time. Now of course, this does not mean the item was not set, but that the local network was not quick enough to receive an acknowledgement from the server.
I'll continue investigating the rest. I could not repro on my current env (OS X), so I will check in EL6 and see what gives
Hrm, I could still not replicate this, even on CentOS with perl v5.10.1. I would imagine there may be a bug, but this bug would reside elsewhere in the code.
Can you provide a self contained example to replicate this?
According to the XS code:
if (!SvOK(value)) {
die("Passed empty value");
}
And I got somehting to replicate it, so nvm :)
Now I feel really stupid:
From http://search.cpan.org/~rjbs/perl-5.18.4/pod/perltie.pod:
BUGS ^
The bucket usage information provided by scalar(%hash) is not available. What this means is that using %tied_hash in boolean context doesn't work right (currently this always tests false, regardless of whether the hash is empty or hash elements).
Localizing tied arrays or hashes does not work. After exiting the scope the arrays or the hashes are not restored.
Well, that's fairly obscure... News to me as well. A built-in library that doesn't work right. Heh.
I think I found the issue.. It's a bit obscure and is specific to older Perls. It has to do with local
izing values. Apparently older versions do something like:
undef
While newer versions seem to skip step 2.
That doesn't explain why I'm getting client-side timeouts after 25 seconds (that's what it's been during the last few tests I ran) when I'm setting operation_timeout
to 300 seconds.
my $op_timeout = $CB->settings->{operation_timeout};
$CB->settings->{operation_timeout} = 300;
my $batch = $CB->batch;
if ($mode == MODE_REPLACE) { $batch->replace($_, $args) for @docs; }
elsif ($mode == MODE_INSERT) { $batch->insert($_, $args) for @docs; }
elsif ($mode == MODE_UPSERT) { $batch->upsert($_, $args) for @docs; }
elsif ($mode == MODE_REMOVE) { $batch->remove($_, $args) for @docs; }
$batch->wait_all;
$CB->settings->{operation_timeout} = $op_timeout;
my %errors = ();
my @tmpfail = ();
foreach my $doc (@docs) {
if ($doc->is_ok) {
# NOOP
} elsif ($doc->is_not_found || $doc->is_cas_mismatch || $doc->is_already_exists) {
# collision, need to redo
warn "Collision $doing " . $doc->id . ", need to redo: " . $doc->errstr . " (" . $doc->errnum . ")\n";
push @redo_ids, $doc->id;
} elsif ($doc->errnum == COUCHBASE_ETMPFAIL
|| $doc->errnum == COUCHBASE_CLIENT_ETMPFAIL
|| $doc->errnum == COUCHBASE_NETWORK_ERROR
|| $doc->errnum == COUCHBASE_ETIMEDOUT)
{
# tmpfail, will redo
my $error = $doc->errstr . " (" . $doc->errnum . ")";
warn "Temporary failure $doing " . $doc->id . ", will redo (other similar errors are silenced): $error\n" if !$errors{$error}++;
push @tmpfail, $doc;
} else {
die "Error $doing " . $doc->id . ": " . $doc->errstr . " (" . $doc->errnum . ")\n";
}
}
What I get is, about 30 second delay during wait_all
, and then about 90% of the documents get client-side failure. I then sleep 5 and run through the @tmpfail
docs again, only to get Collision
warnings for all the documents. So, somehow, Couchbase wrote all 100k documents (that's the batch size).
One thing that might address your issue. Keep in mind that timeouts are set from the moment the command is created, e.g. $cb->upsert(), and not from the time $batch->wait_all is called. If you are indeed loading 100k documents, mind the actual wall time it's taking for your script to load this all into memory.
You'd probably do better partitioning your batches (maybe via splice) so that it amounts to 1k per batch. It's better on the network that way anyway.
See if the above patch fixes your issue wiht localized hashes.
Apparently, it's only localized hashes themselves that cannot be tied. Their elements are OK.
Yes, local
now works.
This is still weird:
Thu Oct 16 21:41:10 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
Thu Oct 16 21:41:11 2014 DEBUG: Mine::Couchbase::cb_write: Waiting for 100000 docs
Thu Oct 16 21:41:35 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
Thu Oct 16 21:41:35 2014 WARNING: Temporary failure inserting /KEY/, will redo (other similar errors are silenced): Client-Side timeout exceeded for operation. Inspect
Thu Oct 16 21:41:36 2014 WARNING: Found 77866 errors: Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout (23)
Thu Oct 16 21:41:36 2014 WARNING: Temporary failure inserting 77866 docs, will redo
Thu Oct 16 21:41:36 2014 DEBUG: Mine::Couchbase::cb_write: Sleeping 5 seconds before redo
Thu Oct 16 21:41:41 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 77866 docs
Thu Oct 16 21:41:42 2014 DEBUG: Mine::Couchbase::cb_write: Waiting for 77866 docs
Thu Oct 16 21:41:42 2014 DEBUG: Mine::Couchbase::cb_write: Checking 77866 docs
Thu Oct 16 21:41:42 2014 WARNING: Collision inserting /KEY/, need to redo (other similar errors are silenced): The key already exists in the server. If you have supplie
Thu Oct 16 21:41:42 2014 WARNING: Found 77866 errors: The key already exists in the server. If you have supplied a CAS then the key exists with a CAS value different th
Thu Oct 16 21:41:42 2014 DEBUG: Mine::Couchbase::cb_write: Made changes to 22134 docs, need to redo 77866 docs
Watch the time differences. The $batch->insert()
happen after "Inserting" log message. So, 25 seconds between "Inserting" and "Checking".
log("inserting...");
$batch->insert($_) for @docs;
log("waiting...");
$batch->wait_all;
log("checking..."): # this is 25 seconds after insert() is done
for() loop finds 77k client-side errors out of 100k docs
sleep 5
log("inserting...");
$batch->insert($_) for @docs;
log("waiting...");
$batch->wait_all;
log("checking..."): # this is 25 seconds after insert() is done
for() loop finds that all 77k docs are already in the bucket
I'm specifically setting operation_timeout = 300 before $batch->insert()
is called.
I'm not sure if it's a related issue, but I can't affect the code behavior by setting different timeout values.
You aren't using any of the persist_to
or replicate_to
here, are you? Unfortunately those cover a different timeout setting
I'd also be curious to see what happens if you set LCB_LOGLEVEL=5
in your environment to get the client logs..
Yes, I am indeed using persist_to=1
and replicate_to=0
. Doh! OK, set durability_timeout
to 300 as well now. That took care of the client-side timeout issue! - now 100k insert()
s finish in around 150-170 seconds.
For some reason I started seeing timeouts after only a few seconds - the first few times I tried updated code. But this went away after I flushed the bucket a few times and retried. Can't replicate it now, so assuming working until proven otherwise.
LCB_LOGLEVEL=5
is a neat trick.
{
$log->debug("Setting operation timeout to $cfg->{cb_batch_timeout}") if $log && $cfg && $cfg->{cb_batch_timeout};
local $CB->settings->{operation_timeout} = $cfg->{cb_batch_timeout} if $cfg && $cfg->{cb_batch_timeout};
$log->debug("Setting durability timeout to $cfg->{cb_batch_timeout}")
if $log && $cfg && $cfg->{cb_batch_timeout} && ($args->{persist_to} || $args->{replicate_to});
local $CB->settings->{durability_timeout} = $cfg->{cb_batch_timeout}
if $cfg && $cfg->{cb_batch_timeout} && ($args->{persist_to} || $args->{replicate_to});
my $batch = $CB->batch;
if ($mode == MODE_REPLACE) { $batch->replace($_, $args) for @docs; }
elsif ($mode == MODE_INSERT) { $batch->insert($_, $args) for @docs; }
elsif ($mode == MODE_UPSERT) { $batch->upsert($_, $args) for @docs; }
elsif ($mode == MODE_REMOVE) { $batch->remove($_, $args) for @docs; }
$log->debug("Waiting for " . @docs . " docs") if $log;
$batch->wait_all;
}
Thank you! I'm closing this bug now.
That was premature... It starts off fine, but after it inserts about 1M keys, the timeouts start mis-behaving. Here's the output with the relevant log lines, with time diffs in seconds as the first columns in the output:
$ egrep 'Inserting|Checking|WARN' 1|td
Fri Oct 17 10:12:41 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
164 Fri Oct 17 10:15:25 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
10 Fri Oct 17 10:15:35 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
167 Fri Oct 17 10:18:22 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
9 Fri Oct 17 10:18:31 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
166 Fri Oct 17 10:21:17 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
12 Fri Oct 17 10:21:29 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
161 Fri Oct 17 10:24:10 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
8 Fri Oct 17 10:24:18 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
153 Fri Oct 17 10:26:51 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
8 Fri Oct 17 10:26:59 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
156 Fri Oct 17 10:29:35 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
8 Fri Oct 17 10:29:43 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
153 Fri Oct 17 10:32:16 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
8 Fri Oct 17 10:32:24 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
192 Fri Oct 17 10:35:36 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
9 Fri Oct 17 10:35:45 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
176 Fri Oct 17 10:38:41 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
8 Fri Oct 17 10:38:49 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
104 Fri Oct 17 10:40:33 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
# So, only 104 seconds after next batch of 100k insert()s
0 Fri Oct 17 10:40:33 2014 WARNING: Temporary failure inserting key103, will redo (other similar errors are silenced): Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout (23)
0 Fri Oct 17 10:40:33 2014 WARNING: Found 17234 errors: Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout (23)
0 Fri Oct 17 10:40:33 2014 WARNING: Temporary failure inserting 17234 docs, will redo
5 Fri Oct 17 10:40:38 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 17234 docs
1 Fri Oct 17 10:40:39 2014 DEBUG: Mine::Couchbase::cb_write: Checking 17234 docs
# Tried to re-insert() the same 17234 docs, this time got a collision
0 Fri Oct 17 10:40:39 2014 WARNING: Collision inserting key103, need to redo (other similar errors are silenced): The key already exists in the server. If you have supplied a CAS then the key exists with a CAS value different than specified (12)
0 Fri Oct 17 10:40:39 2014 WARNING: Found 17234 errors: The key already exists in the server. If you have supplied a CAS then the key exists with a CAS value different than specified (12)
15 Fri Oct 17 10:40:54 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
7 Fri Oct 17 10:41:01 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
# Inserting next batch of 100k docs, got a timeout only 7 seconds later
0 Fri Oct 17 10:41:01 2014 WARNING: Temporary failure inserting key201, will redo (other similar errors are silenced): Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout (23)
0 Fri Oct 17 10:41:01 2014 WARNING: Found 100000 errors: Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout (23)
0 Fri Oct 17 10:41:01 2014 WARNING: Temporary failure inserting 100000 docs, will redo
5 Fri Oct 17 10:41:06 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
3 Fri Oct 17 10:41:09 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
# And now all 100k got a collision
0 Fri Oct 17 10:41:09 2014 WARNING: Collision inserting key201, need to redo (other similar errors are silenced): The key already exists in the server. If you have supplied a CAS then the key exists with a CAS value different than specified (12)
1 Fri Oct 17 10:41:10 2014 WARNING: Found 100000 errors: The key already exists in the server. If you have supplied a CAS then the key exists with a CAS value different than specified (12)
15 Fri Oct 17 10:41:25 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
6 Fri Oct 17 10:41:31 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
# and again, 100k insert()s timeout only 6 seconds later
0 Fri Oct 17 10:41:31 2014 WARNING: Temporary failure inserting key117, will redo (other similar errors are silenced): Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout (23)
0 Fri Oct 17 10:41:31 2014 WARNING: Found 100000 errors: Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout (23)
0 Fri Oct 17 10:41:31 2014 WARNING: Temporary failure inserting 100000 docs, will redo
5 Fri Oct 17 10:41:36 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
3 Fri Oct 17 10:41:39 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
0 Fri Oct 17 10:41:39 2014 WARNING: Collision inserting key117, need to redo (other similar errors are silenced): The key already exists in the server. If you have supplied a CAS then the key exists with a CAS value different than specified (12)
1 Fri Oct 17 10:41:40 2014 WARNING: Found 100000 errors: The key already exists in the server. If you have supplied a CAS then the key exists with a CAS value different than specified (12)
14 Fri Oct 17 10:41:54 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
7 Fri Oct 17 10:42:01 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
0 Fri Oct 17 10:42:01 2014 WARNING: Temporary failure inserting key146, will redo (other similar errors are silenced): Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout (23)
0 Fri Oct 17 10:42:01 2014 WARNING: Found 100000 errors: Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout (23)
0 Fri Oct 17 10:42:01 2014 WARNING: Temporary failure inserting 100000 docs, will redo
5 Fri Oct 17 10:42:06 2014 DEBUG: Mine::Couchbase::cb_write: Inserting 100000 docs
4 Fri Oct 17 10:42:10 2014 DEBUG: Mine::Couchbase::cb_write: Checking 100000 docs
0 Fri Oct 17 10:42:10 2014 WARNING: Collision inserting key146, need to redo (other similar errors are silenced): The key already exists in the server. If you have supplied a CAS then the key exists with a CAS value different than specified (12)
0 Fri Oct 17 10:42:10 2014 WARNING: Found 100000 errors: The key already exists in the server. If you have supplied a CAS then the key exists with a CAS value different than specified (12)
This is running against 3-node cluster, with 2 replicas, persist_to=>3
, replicate_to=>2
, and both operation_timeout
and durability_timeout
being set to 600 every time before the insert()
s are performed in batch.
I'm working on a self-contained test for you.
Would be interesting to see the client logs on this (might want to filter out any 'STALE COMMAND' or similar message).
Please set the detailed_errcodes=1
option in the connection string as well (I should probably expose it via ->settings
as well). This may also help shed some light.
For your use case, I should probably add an endure()
method (the persist_to
and replicate_to
parameters are quite inefficient for large bulk operations, as they generate quite a bit of traffic when polling).
OK, I have a self-contained test which reproduces this. I opened a new issue #25 with full code and output.
Test script, with relevant line numbers:
Output with
local
:Output without
local
:One can work around this by saving current value of
operation_timeout
, changing it, and restoring it to the previous value afterget()
, but... :)