mnunberg / perl-Couchbase-Client

Perl Client for Couchbase
http://www.couchbase.com
16 stars 11 forks source link

Bad things [still] happen during server rebalance #18

Closed ivulfson closed 10 years ago

ivulfson commented 10 years ago

I have a script that does a batch get() of 100k keys by id from the bucket. If you start rebalancing the cluster when the script is running, this happens:

Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
encountered object 'Couchbase::Document=ARRAY(0x3fba26f0)', but neither allow_blessed nor convert_blessed settings are enabled at ./dump.pl line 51.
Attempt to free unreferenced scalar: SV 0x43fd0fc0, Perl interpreter: 0x10c0010 at ./dump.pl line 51.
Attempt to free unreferenced scalar: SV 0x43f83d10, Perl interpreter: 0x10c0010 during global destruction.
Attempt to free unreferenced scalar: SV 0x43f86788, Perl interpreter: 0x10c0010 during global destruction.
Attempt to free unreferenced scalar: SV 0x43f86d88, Perl interpreter: 0x10c0010 during global destruction.
Attempt to free unreferenced scalar: SV 0x43f87388, Perl interpreter: 0x10c0010 during global destruction.
Attempt to free unreferenced scalar: SV 0x43f89e00, Perl interpreter: 0x10c0010 during global destruction.
Attempt to free unreferenced scalar: SV 0x43f8a400, Perl interpreter: 0x10c0010 during global destruction.
Attempt to free unreferenced scalar: SV 0x43f8aa00, Perl interpreter: 0x10c0010 during global destruction.
Attempt to free unreferenced scalar: SV 0x43f8c548, Perl interpreter: 0x10c0010 during global destruction.

and then it dies. I remember there was a similar issue in libcouchbase 2.0.7 and 2.1.3, but i'm using libcouchbase 2.4.2, perl-Couchbase 2.0.0_2, and hitting Couchbase server 3.0.0.

Here's the [ugly] yet relevant perl code with line numbers:

    36      my $i = 0;
    37      while (my @batch = splice(@keys, 0, $cfg->{batch_size})) {
    38          $i += @batch;
    39          print "$i\n";
    40  
    41          my @docs = map { Couchbase::Document->new($_) } @batch;
    42  
    43          while (@docs) {
    44              my $batch = $CB->batch;
    45              $batch->get($_) for @docs;
    46  
    47              my @redo_docs = ();
    48              my %errors    = ();
    49              while (my $doc = $batch->wait_one) {
    50                  if ($doc->is_ok) {
    51                      print $fh $doc->id . "\t" . $JSON->encode($doc->value) . "\n";
    52                  } else {
    53                      my $error = $doc->errnum;
    54                      warn "Error getting " . $doc->id . ", will redo (other similar errors are silenced): $error\n" if !$errors{$error}++;
    55                      push @redo_docs, $doc;
    56                  }
    57              }
    58  
    59              @docs = @redo_docs;
    60              if (@redo_docs) {
    61                  foreach my $error (sort keys %errors) {
    62                      warn "Found $errors{$error} errors: $error\n";
    63                  }
    64                  print "sleeping...\n";
    65                  sleep 3;
    66              }
    67          }
    68      }
mnunberg commented 10 years ago
encountered object 'Couchbase::Document=ARRAY(0x3fba26f0)', but neither allow_blessed nor convert_blessed settings are enabled at ./dump.pl line 51.

This text (the allow_blessed and convert_blessed) are part of the older client. I am not sure how they managed to get into what you're using now. The only thing I can think of is that you are somehow loading both versions of the library and having one symbol override the other? (if so, I should probably rename the function prefix...)

ivulfson commented 10 years ago

I'll try to remove all installed client RPMs and double check that there are no remnants of the old files anywhere, reinstall and retry. I'll let you know either way by Monday.

ivulfson commented 10 years ago

Alright, I've removed these RPMs:

libcouchbase-devel
libcouchbase2-core
perl-Couchbase (build from Couchbase::Client 2.0.0_2)

updatedb, locate couchbase and Couchbase and remove. reinstall the libcouchbase RPMs, recompiled the Couchbase::Client 2.0.0_2 lib into an RPM and installed it.

I start the script, wait a little, then hit the Rebalance button after adding a new node to the cluster.

dumping to dump.out
100000
200000
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
encountered object 'Couchbase::Document=ARRAY(0x43c35d20)', but neither allow_blessed nor convert_blessed settings are enabled at ./dump.pl line 51.
Attempt to free unreferenced scalar: SV 0x4a081418, Perl interpreter: 0x129f010 at ./dump.pl line 51.
Attempt to free unreferenced scalar: SV 0x4a081490, Perl interpreter: 0x129f010 during global destruction.
Attempt to free unreferenced scalar: SV 0x4a081460, Perl interpreter: 0x129f010 during global destruction.
Attempt to free unreferenced scalar: SV 0x4a081448, Perl interpreter: 0x129f010 during global destruction.
Attempt to free unreferenced scalar: SV 0x4a07eef8, Perl interpreter: 0x129f010 during global destruction.
Attempt to free unreferenced scalar: SV 0x4a081430, Perl interpreter: 0x129f010 during global destruction.
Attempt to free unreferenced scalar: SV 0x4a081670, Perl interpreter: 0x129f010 during global destruction.
Attempt to free unreferenced scalar: SV 0x42143238, Perl interpreter: 0x129f010 during global destruction.
Attempt to free unreferenced scalar: SV 0x42a0ac70, Perl interpreter: 0x129f010 during global destruction.
Attempt to free unreferenced scalar: SV 0x4a07f3a8, Perl interpreter: 0x129f010 during global destruction.

I even tried it a couple times, to make sure I didn't miss anything, removing the ~/rpm (local cpan2rpm dir) and the perl-Couchbase-Client-master dir (this is based on your fork of the client, not the one in couchbaselabs).

RHEL6.4 64-bit, libcouchbase 2.4.2, Couchbase::Client 2.0.0_2, Couchbase server 3.0.0.

ivulfson commented 10 years ago

Oh, well, this is interesting. I waited for the rebalance to finish, so now I have a cluster with 2 nodes. Started the same script, and got the same error. So, it's not the rebalance itself, it's having 2 nodes in the cluster. Stranger yet, it doesn't happen right away, it actually gets some data out of the cluster before dying - ~3k (the batch size is 100k).

If it makes any difference, one node is on the same box that's running the script; the other node is on a different physical box on a different subnet.

mnunberg commented 10 years ago

This says the original flags are 0x02, which means compressed. Did you use compression with your existing dataset? (https://github.com/mnunberg/perl-Couchbase-Client/blob/2.0.0_0/xs/plcb-convert.h#L9).

Also note, the allow_blessed is still in an older version; so whatever you're using, it may be an older version still somehow (please be aware the newer module is Couchbase and not Couchbase::Client.

In any event, I've identified a bug in the new client where unrecognized flags fail to increment a refcount, which may be the cause of your crashes.

mnunberg commented 10 years ago

Regarding the allow_blessed message, I've determined this is actually from the JSON module. I do recall we had a similar message in the older client; my bad.

ivulfson commented 10 years ago

I didn't specifically specify to use compression. The bucket is created with default settings (except for per-bucket password and flushable turned on). The data is inserted in batches:

push @batch, Couchbase::Document->new($key, $value); # $value is just a hashref

and

my $batch = $CB->batch;
$batch->upsert($_) for @batch;
$batch->wait_all;

So, no special flags are being set. I'm assuming that makes it default to "json".

I did clean the box of any libcouchbase, Couchbase::Client, or anything couchbase-related, before yum install libcouchbase-devel and compiling perl-Couchbase-Client-master against it.

mnunberg commented 10 years ago

Can you try with the latest master? I think this might have caused the issue. Still writing more tests and investigating

ivulfson commented 10 years ago

Speaking of the JSON module. perl-Couchbase Makefile.PL requires JSON 2.53 - however, we're using JSON 2.50 everywhere, so I changed the Makefile to require that instead. Is there any particular reason you're requiring 2.53?

ivulfson commented 10 years ago

Yes, will try the new master. :)

ivulfson commented 10 years ago

OK, progress. With just-pulled master:

dumping to 1
100000
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
hash- or arrayref expected (not a simple scalar, use allow_nonref to allow this) at ./dump.pl line 51.
/dies/

This is running against an idle cluster with 2 nodes, no rebalance or reindexing or compaction in progress.

mnunberg commented 10 years ago

In this case it's likely expected behavior. The warning message basically says that a different client stored the objects with a value of 0x02 for the flags -- the flags indicate the value format.

The warning message tells you that since it cannot understand the format of the value, it will return it to you in the ->value field as a byte string (rather than trying to interpret it as JSON). Now when you try to encode the string as JSON again you get a failure message because the value itself is not a scalar.

Are you using any other clients in your environment (the older Couchbase::Client, and possibly others)?

ivulfson commented 10 years ago

I added a Data::Dumper call in the while loop:

        while (my $doc = $batch->wait_one) {
            if ($doc->is_ok) {
                use Data::Dumper;
                $Data::Dumper::Sortkeys = 1;
                print Dumper($doc);
                print $fh $doc->id . "\t" . $JSON->encode($doc->value) . "\n";
            } else {
                my $error = $doc->errnum;
                warn "Error getting " . $doc->id . ", will redo (other similar errors are silenced): $error\n" if !$errors{$error}++;
                push @redo_docs, $doc;
            }   
        }   

This is the last document where it dies:

Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
Unrecognized flags 0x2. Assuming raw at ./dump.pl line 49.
$VAR1 = bless( [
                 bless( do{\(my $o = 1107330528)}, 'Couchbase::OpContext' ),
                 'KEY',   <------ actual key removed
                 'STRING UNDECODED JSON BLOB',   <------- Actual content removed, it's just an undecoded JSON string
                 0,
                 '8684169919969428480',
                 undef,
                 undef,
                 '33554432'
               ], 'Couchbase::Document' );
hash- or arrayref expected (not a simple scalar, use allow_nonref to allow this) at ./dump.pl line 54.
ivulfson commented 10 years ago

No other clients on this box, it's my dev VM. I did clean out all the older perl Couchbase::Client code.

ivulfson commented 10 years ago

So, since it's assuming raw, the JSON blob doesn't get decoded, and then $doc->value returns a scalar, and JSON::encode() complains that it doesn't do scalars. OK, now I understand. What's left to figure out why it's getting unrecognized flag 0x02.

mnunberg commented 10 years ago

It's probably coming from an older version of some client library. Were you using the "Stable" Couchbase::Client (1.x) or the <= 2.0.0_1 version?

ivulfson commented 10 years ago

No, I'm using the fresh pull of this library. Both to populate the bucket and to dump data out of it.

Note that this happens only if there are two nodes in the cluster. It does not happen when there's only a single node.

mnunberg commented 10 years ago

The number of nodes should not make any difference. The population part is interesting though. Are you specifying any specific flags for the population phase (or format values)

ivulfson commented 10 years ago

Nope, no format or other optional fields.

push @batch, Couchbase::Document->new($key, $value); # $value is just a hashref

and

my $batch = $CB->batch;
$batch->upsert($_) for @batch;
$batch->wait_all;

Here, $value is a Json hashref.

ivulfson commented 10 years ago

I will check that the bucket indeed contains a Json blob for the failing doc. In about 30 min.

ivulfson commented 10 years ago

Just checked. The failing document - and it's always the same document over the few runs of the dumper script - has a proper JSON value in the bucket, not a string, but an actual JSON hash.

mnunberg commented 10 years ago

What does the cbc tool display as its flags? When storing JSON with the perl client, the value should be something like:

mnunberg@mbp15 ~/Source/perl-Couchbase-Client $ cbc cat foo
foo                  CAS=0xd13407801f010000, Flags=0x2000000, Datatype=0x0

{"name":"FOO","email":"foo@bar.com","friends":["bar","baz","blah","gah","Barrack H. Obama","George W. Bush"]}mnunberg@mbp15 ~/Source/perl-Couchbase-Client $ 
ivulfson commented 10 years ago

I can get just that one single document by key:

    my $doc = Couchbase::Document->new($key);
    $CB->get($doc);

and get the "Unrecognized flags 0x2. Assuming raw" warning, and the value is an undecoded JSON string.

Again, that was with two nodes, although I know you said it makes no difference.

So, then I tried this: I removed one of the nodes and rebalanced the cluster, resulting in a single node. After rebalance was finished, get() works just fine.

ivulfson commented 10 years ago

cbc cat output for a single-node cluster:

/KEY/ CAS=0x78845e8810e90400, Flags=0x2000000, Datatype=0x0

cbc cat output for a 2-node cluster, after rebalance, is the same exactly.

mnunberg commented 10 years ago

How are you intiializing the client object? are you passing all the nodes of the cluster? just a single node? -- I still think you might actually be connecting to the wrong cluster (!). What is the output when you have two nodes (you can use the -U option to cbc and pass the connection string you give to Couchbase::Bucket->new()

ivulfson commented 10 years ago

Oh. Just a single node. However, adding the second node to the list doesn't make a difference.

my $CB = Couchbase::Bucket->new("couchbase://$cfg->{cb_server}/$cfg->{cb_bucket}", { password => $cfg->{cb_password} });

$cfg->{cb_server} is set to:

host1.domain.com,host2.domain.com

ivulfson commented 10 years ago
$ cbc cat -U couchbase://host1.domain.com,host2.domain.com/bucket -P password KEY
KEY      CAS=0x1627ec63f3ed0400, Flags=0x2, Datatype=0x0
{json string}
$ cbc cat -U couchbase://host1.domain.com/bucket -P password KEY
KEY      CAS=0x1627ec63f3ed0400, Flags=0x2, Datatype=0x0
{json string}
$ cbc cat -U couchbase://host2.domain.com/bucket -P password KEY
KEY      CAS=0x1627ec63f3ed0400, Flags=0x2, Datatype=0x0
{json string}

So, same output, regardless of which server is in the connection string. The output above where it had Flags=0x2000000 was when I ran cbc cat with just:

$ cbc cat -b bucket -P password KEY
KEY      CAS=0x78845e8810e90400, Flags=0x2000000, Datatype=0x0
{json string}

(so, connecting to just the server on the localhost)

mnunberg commented 10 years ago

It's quite clear that your cluster on localhost is not the same as your cluster on those other two nodes :)

ivulfson commented 10 years ago

The node on localhost is one of the 2 nodes in the cluster - it's the host1.domain.com == localhost. The other node is on another box - that's host2.domain.com

mnunberg commented 10 years ago

So let's try to eliminate some issues:

ivulfson commented 10 years ago

Weird, now both of these commands have [almost] the same output:

$ cbc cat -U couchbase://host1.domain.com/bucket -P password KEY
KEY      CAS=0x1627ec63f3ed0400, Flags=0x2, Datatype=0x0
{json}
$ cbc cat -b bucket -P password KEY
The -b/--bucket option is deprecated. Use connection string instead
  e.g. couchbase://HOSTS/bucket
KEY      CAS=0x1627ec63f3ed0400, Flags=0x2, Datatype=0x0
{json}

I don't see Flags=0x2000000 anymore. But get() still fails with Unrecognized flags 0x2. Assuming raw, regardless of whether I'm connecting to host1, host2, or "host1,host2".

ivulfson commented 10 years ago

OK, good. I modified the item using the editor in Couchbase's web UI, by adding "aaa":"1" to the hash. Now:

$ cbc cat -U couchbase://host1.domain.com/bucket -P password KEY
KEY      CAS=0x4deeb7ed207d0500, Flags=0x0, Datatype=0x0
{"aaa":"1",/rest of json/}
$ cbc cat -U couchbase://host2.domain.com/bucket -P password KEY
KEY      CAS=0x4deeb7ed207d0500, Flags=0x0, Datatype=0x0
{"aaa":"1",/rest of json/}
$ cbc cat -U couchbase://localhost/bucket -P password KEY
KEY      CAS=0x4deeb7ed207d0500, Flags=0x0, Datatype=0x0
{"aaa":"1",/rest of json/}

Note that the flags are now 0x0

mnunberg commented 10 years ago

Modifying the value from the UI will cause the flags to be reset in this regard. (this is actually a bug here). But I guess we can confirm they are all indeed part of the same cluster.

ivulfson commented 10 years ago

The environment isn't production, it's my test boxes, and only I'm running stuff on there. No other scripts are modifying the buckets.

When I added get() right after upsert(), the warning is gone, so I can't replicate this right now. I just flushed the bucket and reloading the whole data set back, the way it was. While it's loading, I can tell that the KEY in question got loaded with Flags=0x2000000 - checked localhost, host1 (which is FQDN of localhost), and host2. A simple individual get() to get a single key works as well.

ivulfson commented 10 years ago

Finally. I figured out a way to replicate this. Starting with a single-node cluster.

  1. load a few documents into the cluster. I used 16 documents.
  2. run cbc cat on all loaded documents to get their flags - all flags show up as Flags=0x2000000.
key48      CAS=0x2c710b7766f10400, Flags=0x2000000, Datatype=0x0
key49      CAS=0xa78b0f7766f10400, Flags=0x2000000, Datatype=0x0
key50      CAS=0x990127766f10400, Flags=0x2000000, Datatype=0x0
key51      CAS=0xeaee147766f10400, Flags=0x2000000, Datatype=0x0
key52      CAS=0xca35177766f10400, Flags=0x2000000, Datatype=0x0
key53      CAS=0x18b0217766f10400, Flags=0x2000000, Datatype=0x0
key54      CAS=0xce9d367766f10400, Flags=0x2000000, Datatype=0x0
key55      CAS=0x6ab3397766f10400, Flags=0x2000000, Datatype=0x0
key56      CAS=0x44f03b7766f10400, Flags=0x2000000, Datatype=0x0
key57      CAS=0xd8133e7766f10400, Flags=0x2000000, Datatype=0x0
key58      CAS=0xbd3c407766f10400, Flags=0x2000000, Datatype=0x0
key59      CAS=0x369e427766f10400, Flags=0x2000000, Datatype=0x0
key60      CAS=0xabdc457766f10400, Flags=0x2000000, Datatype=0x0
key61      CAS=0x10e54f7766f10400, Flags=0x2000000, Datatype=0x0
key62      CAS=0x317527766f10400, Flags=0x2000000, Datatype=0x0
key63      CAS=0xea74547766f10400, Flags=0x2000000, Datatype=0x0
  1. get() all documents to make sure they all populated correctly - they did.
  2. add another node to cluster, rebalance.
  3. repeat cbc cat on all loaded documents to get their flags. This is where it gets sticky. On the original node (localhost, which is the same as host1):
key48      CAS=0x2c710b7766f10400, Flags=0x2000000, Datatype=0x0
key49      CAS=0xa78b0f7766f10400, Flags=0x2, Datatype=0x0
key50      CAS=0x990127766f10400, Flags=0x2, Datatype=0x0
key51      CAS=0xeaee147766f10400, Flags=0x2000000, Datatype=0x0
key52      CAS=0xca35177766f10400, Flags=0x2000000, Datatype=0x0
key53      CAS=0x18b0217766f10400, Flags=0x2, Datatype=0x0
key54      CAS=0xce9d367766f10400, Flags=0x2000000, Datatype=0x0
key55      CAS=0x6ab3397766f10400, Flags=0x2, Datatype=0x0
key56      CAS=0x44f03b7766f10400, Flags=0x2, Datatype=0x0
key57      CAS=0xd8133e7766f10400, Flags=0x2000000, Datatype=0x0
key58      CAS=0xbd3c407766f10400, Flags=0x2000000, Datatype=0x0
key59      CAS=0x369e427766f10400, Flags=0x2, Datatype=0x0
key60      CAS=0xabdc457766f10400, Flags=0x2000000, Datatype=0x0
key61      CAS=0x10e54f7766f10400, Flags=0x2, Datatype=0x0
key62      CAS=0x317527766f10400, Flags=0x2, Datatype=0x0
key63      CAS=0xea74547766f10400, Flags=0x2000000, Datatype=0x0

On the other node (host2):

key48      CAS=0x2c710b7766f10400, Flags=0x2000000, Datatype=0x0
key49      CAS=0xa78b0f7766f10400, Flags=0x2, Datatype=0x0
key50      CAS=0x990127766f10400, Flags=0x2, Datatype=0x0
key51      CAS=0xeaee147766f10400, Flags=0x2000000, Datatype=0x0
key52      CAS=0xca35177766f10400, Flags=0x2000000, Datatype=0x0
key53      CAS=0x18b0217766f10400, Flags=0x2, Datatype=0x0
key54      CAS=0xce9d367766f10400, Flags=0x2000000, Datatype=0x0
key55      CAS=0x6ab3397766f10400, Flags=0x2, Datatype=0x0
key56      CAS=0x44f03b7766f10400, Flags=0x2, Datatype=0x0
key57      CAS=0xd8133e7766f10400, Flags=0x2000000, Datatype=0x0
key58      CAS=0xbd3c407766f10400, Flags=0x2000000, Datatype=0x0
key59      CAS=0x369e427766f10400, Flags=0x2, Datatype=0x0
key60      CAS=0xabdc457766f10400, Flags=0x2000000, Datatype=0x0
key61      CAS=0x10e54f7766f10400, Flags=0x2, Datatype=0x0
key62      CAS=0x317527766f10400, Flags=0x2, Datatype=0x0
key63      CAS=0xea74547766f10400, Flags=0x2000000, Datatype=0x0

Getting any of the documents which were converted from Flags=0x2000000 to Flags=0x2 as part of the rebalance results in the Unrecognized flags 0x2. Assuming raw warning and undecoded JSON.

mnunberg commented 10 years ago

Oh dear. Can you verify the version/build of the 3.0 cluster you're using? This clearly indicates the flags being mangled while the CAS remains in-tact!

mnunberg commented 10 years ago

It seems like something somewhere is flipping the bits incorrectly. htonl(0x2000000) is 2

ivulfson commented 10 years ago

Just for fun, I removed the original single node (host1) from the cluster and rebalanced. After rebalance, cbc cat on host2 show:

key48      CAS=0x2c710b7766f10400, Flags=0x2, Datatype=0x0
key49      CAS=0xa78b0f7766f10400, Flags=0x2, Datatype=0x0
key50      CAS=0x990127766f10400, Flags=0x2, Datatype=0x0
key51      CAS=0xeaee147766f10400, Flags=0x2, Datatype=0x0
key52      CAS=0xca35177766f10400, Flags=0x2, Datatype=0x0
key53      CAS=0x18b0217766f10400, Flags=0x2, Datatype=0x0
key54      CAS=0xce9d367766f10400, Flags=0x2, Datatype=0x0
key55      CAS=0x6ab3397766f10400, Flags=0x2, Datatype=0x0
key56      CAS=0x44f03b7766f10400, Flags=0x2, Datatype=0x0
key57      CAS=0xd8133e7766f10400, Flags=0x2, Datatype=0x0
key58      CAS=0xbd3c407766f10400, Flags=0x2, Datatype=0x0
key59      CAS=0x369e427766f10400, Flags=0x2, Datatype=0x0
key60      CAS=0xabdc457766f10400, Flags=0x2, Datatype=0x0
key61      CAS=0x10e54f7766f10400, Flags=0x2, Datatype=0x0
key62      CAS=0x317527766f10400, Flags=0x2, Datatype=0x0
key63      CAS=0xea74547766f10400, Flags=0x2, Datatype=0x0

And for more fun, I then added host1 back (wondering if 0x2 would turn back to 0x2000000), and rebalanced. Now on host1:

key48      CAS=0x2c710b7766f10400, Flags=0x2, Datatype=0x0
key49      CAS=0xa78b0f7766f10400, Flags=0x2000000, Datatype=0x0
key50      CAS=0x990127766f10400, Flags=0x2000000, Datatype=0x0
key51      CAS=0xeaee147766f10400, Flags=0x2, Datatype=0x0
key52      CAS=0xca35177766f10400, Flags=0x2, Datatype=0x0
key53      CAS=0x18b0217766f10400, Flags=0x2000000, Datatype=0x0
key54      CAS=0xce9d367766f10400, Flags=0x2, Datatype=0x0
key55      CAS=0x6ab3397766f10400, Flags=0x2000000, Datatype=0x0
key56      CAS=0x44f03b7766f10400, Flags=0x2000000, Datatype=0x0
key57      CAS=0xd8133e7766f10400, Flags=0x2, Datatype=0x0
key58      CAS=0xbd3c407766f10400, Flags=0x2, Datatype=0x0
key59      CAS=0x369e427766f10400, Flags=0x2000000, Datatype=0x0
key60      CAS=0xabdc457766f10400, Flags=0x2, Datatype=0x0
key61      CAS=0x10e54f7766f10400, Flags=0x2000000, Datatype=0x0
key62      CAS=0x317527766f10400, Flags=0x2000000, Datatype=0x0
key63      CAS=0xea74547766f10400, Flags=0x2, Datatype=0x0

and on host2:

key48      CAS=0x2c710b7766f10400, Flags=0x2, Datatype=0x0
key49      CAS=0xa78b0f7766f10400, Flags=0x2000000, Datatype=0x0
key50      CAS=0x990127766f10400, Flags=0x2000000, Datatype=0x0
key51      CAS=0xeaee147766f10400, Flags=0x2, Datatype=0x0
key52      CAS=0xca35177766f10400, Flags=0x2, Datatype=0x0
key53      CAS=0x18b0217766f10400, Flags=0x2000000, Datatype=0x0
key54      CAS=0xce9d367766f10400, Flags=0x2, Datatype=0x0
key55      CAS=0x6ab3397766f10400, Flags=0x2000000, Datatype=0x0
key56      CAS=0x44f03b7766f10400, Flags=0x2000000, Datatype=0x0
key57      CAS=0xd8133e7766f10400, Flags=0x2, Datatype=0x0
key58      CAS=0xbd3c407766f10400, Flags=0x2, Datatype=0x0
key59      CAS=0x369e427766f10400, Flags=0x2000000, Datatype=0x0
key60      CAS=0xabdc457766f10400, Flags=0x2, Datatype=0x0
key61      CAS=0x10e54f7766f10400, Flags=0x2000000, Datatype=0x0
key62      CAS=0x317527766f10400, Flags=0x2000000, Datatype=0x0
key63      CAS=0xea74547766f10400, Flags=0x2, Datatype=0x0

So, yes, 0x2 did turn back into 0x2000000. Except for the opposite set of documents than before. /facepalm

ivulfson commented 10 years ago

Running this on both boxes: Version: 3.0.0 Community Edition (build-1209)

mnunberg commented 10 years ago

One more question. Are all the cluster nodes running the same architecture/OS?

ivulfson commented 10 years ago

RHEL6.4 64-bit. They're actually both VMs. The host1 is a VM running on my desktop, and host2 is a VM running in our vmware cluster. So, same OS, different underlying hardware architecture for sure.

mnunberg commented 10 years ago

I'm just trying to rule out one machine not properly doing htonl or otherwise. Hrm. This would obviously be an issue with the internal replication protocol

ivulfson commented 10 years ago

Let me change the layout to a single node host1 XDCR'ing to a remote host2, and check the flags.

ivulfson commented 10 years ago

OK, with host1 XDCR'ing to host2, both host1 and host2 show this:

key48      CAS=0x2c710b7766f10400, Flags=0x2000000, Datatype=0x0
key49      CAS=0xa78b0f7766f10400, Flags=0x2000000, Datatype=0x0
key50      CAS=0x990127766f10400, Flags=0x2000000, Datatype=0x0
key51      CAS=0xeaee147766f10400, Flags=0x2000000, Datatype=0x0
key52      CAS=0xca35177766f10400, Flags=0x2000000, Datatype=0x0
key53      CAS=0x18b0217766f10400, Flags=0x2000000, Datatype=0x0
key54      CAS=0xce9d367766f10400, Flags=0x2000000, Datatype=0x0
key55      CAS=0x6ab3397766f10400, Flags=0x2000000, Datatype=0x0
key56      CAS=0x44f03b7766f10400, Flags=0x2000000, Datatype=0x0
key57      CAS=0xd8133e7766f10400, Flags=0x2000000, Datatype=0x0
key58      CAS=0xbd3c407766f10400, Flags=0x2000000, Datatype=0x0
key59      CAS=0x369e427766f10400, Flags=0x2000000, Datatype=0x0
key60      CAS=0xabdc457766f10400, Flags=0x2000000, Datatype=0x0
key61      CAS=0x10e54f7766f10400, Flags=0x2000000, Datatype=0x0
key62      CAS=0x317527766f10400, Flags=0x2000000, Datatype=0x0
key63      CAS=0xea74547766f10400, Flags=0x2000000, Datatype=0x0

So, XDCR works.

mnunberg commented 10 years ago

According to my knowledge, 1209 is the latest build (both community and enterprise) for 3.0.0. Probably best to open an MB issue here (https://www.couchbase.com/issues/browse/MB). Let me know if you need help with that.

ivulfson commented 10 years ago

Ugh. Created forums.couchbase.com account, but can't login to couchbase.org. I can't deal with this right now, gotta go before my wife kills me for working on the weekend. :)

mnunberg commented 10 years ago

Alrighty, I'll file something for you - you can chime in at your convenience. Should stop working myself :)

mnunberg commented 10 years ago

https://www.couchbase.com/issues/browse/MB-12328

ivulfson commented 10 years ago

I just set up 3 VMs in the VMware cluster: host1, host2, host3. Started with single host1 node., added host2. After rebalance half of the flags switched from 0x2000000 to 0x2. Removed host2, rebalanced, flags recovered to 0x2000000. Added host2 and host3, rebalanced, and about two third of the flags are 0x2.

Definitely sounds like a problem with DCP. I can reproduce this reliably, and perl Couchbase client isn't at all involved.

I can flush the bucket and reload the documents, and the flags are restored to 0x2000000. This is until One of the nodes is added or removed and the cluster rebalanced. Then whichever documents are moved from one node to another node as a result of rebalance get their flags bits flipped to 0x2.

mnunberg commented 10 years ago

I'm gonna close this issue (not ignore it, but this isn't an outstanding client issue)..