vmware-archive / xenon

Xenon - Decentralized Control Plane Framework
Other
227 stars 97 forks source link

service is gone after a new host joins the node group #9

Closed chaurwu closed 8 years ago

chaurwu commented 8 years ago

I have NodeA and NodeC. NodeA and NodeC has the CityService. CityService has ServiceOption.Duplication and ServiceOption.Owner. I send a POST to NodeA to create city1. city1 is in NodeA only. I then join NodeC with NodeA. city1 is gone. /samples/cities on NodeA still show an entry for city1. But /samples/cities/city1 returns 404. That looks like a bug.

georgechrysanthakopoulos commented 8 years ago

thanks for the issue.

1)how do you join nodes, 2)what is the quorum setting (in each node) 3) is the service marked with the proper service options? OWNER_SELECTION, REPLICATION, etc

chaurwu commented 8 years ago

I use this command to join the two nodes:

curl -H "Content-Type: application/json" -X POST http://localhost:9914/core/node-groups/default --data @peer-join-default.json

The file peer-join-default.json has the following content: { "kind": "com:vmware:xenon:services:common:NodeGroupService:JoinPeerRequest", "memberGroupReference": "http://localhost:9912/core/node-groups/default", "memberhsipQuorum": 1, "localNodeOptions": [ "PEER" ] }

The code of the service is this:

public class CityService extends StatefulService { public static final String FACTORY_LINK = ServiceUriPaths.SAMPLES + "/cities";

public static FactoryService createFactory() {
    return FactoryService.create(CityService.class);
}

public static class CityServiceState extends ServiceDocument {
    public String cityname;
}

public CityService() {
    super(CityServiceState.class);
    super.toggleOption(ServiceOption.INSTRUMENTATION, true);
    super.toggleOption(ServiceOption.REPLICATION, true);
    super.toggleOption(ServiceOption.OWNER_SELECTION, true);
}

}

georgechrysanthakopoulos commented 8 years ago

great. what version of xenon are you using? latest? please attach the process logs from each node (do a GET on /core/management/process-log).

Also, can you please set quorum to 2, when you start the nodes? with quorum of 1, which is the default when you dynamically join nodes, synchronization might run before th enode group is stable.

you can set node quorum when you join, and, better, when you start each node: -Dxenon.NodeState.membershipQuorum=2

see the multi node tutorials

chaurwu commented 8 years ago

The version of xenon I use is 0.9.6 downloaded from the public Maven repository.

The logs of NodeA (listening on port 9912) are here: $ curl http://localhost:9912/core/management/process-log % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 2605 100 2605 0 0 169k 0 --:--:-- --:--:-- --:--:-- 2543k{ "items": [ "[0][I][1475772393630][9912][startImpl][ServiceHost/2034680e listening on http://127.0.0.1:9912]", "[1][I][1475772394683][9912/core/node-groups/default][mergeRemoteAndLocalMembership][State updated, merge with node9912, self node9912, 1475772394682005]", "[3][I][1475772396717][9912/core/node-selectors/default-3x][checkAndScheduleSynchronization][Scheduling synchronization (1 nod es)]", "[2][I][1475772396717][9912/core/node-selectors/default][checkAndScheduleSynchronization][Scheduling synchronization (1 nodes) ]", "[4][I][1475772396733][9912][scheduleNodeGroupChangeMaintenance][/core/node-selectors/default-3x 1475772396733001]", "[5][I][1475772396733][9912][scheduleNodeGroupChangeMaintenance][/core/node-selectors/default 1475772396733003]", "[6][I][1475772464084][9912/core/node-selectors/default][updateCachedNodeGroupState][Node count: 2]", "[7][I][1475772464084][9912/core/node-selectors/default-3x][updateCachedNodeGroupState][Node count: 2]", "[9][I][1475772466779][9912/core/node-selectors/default-3x][checkAndScheduleSynchronization][Scheduling synchronization (2 nod es)]", "[8][I][1475772466779][9912/core/node-selectors/default][checkAndScheduleSynchronization][Scheduling synchronization (2 nodes) ]", "[10][I][1475772466808][9912][scheduleNodeGroupChangeMaintenance][/core/node-selectors/default-3x 1475772466808001]", "[11][I][1475772466808][9912][scheduleNodeGroupChangeMaintenance][/core/node-selectors/default 1475772466808003]", "[12][I][1475772469527][9912/core/query-tasks][lambda$verifyFactoryOwnership$21][node9912 elected as owner for factory /core/q uery-tasks. Starting synch ...]", "[13][I][1475772469528][9912/core/graph-queries][lambda$verifyFactoryOwnership$21][node9912 elected as owner for factory /core /graph-queries. Starting synch ...]", "[14][I][1475772469583][9912/core/tenants][lambda$verifyFactoryOwnership$21][node9912 elected as owner for factory /core/tenan ts. Starting synch ...]", "[15][I][1475772469584][9912/core/authz/resource-groups][lambda$verifyFactoryOwnership$21][node9912 elected as owner for facto ry /core/authz/resource-groups. Starting synch ...]", "[16][I][1475772469607][9912/core/auth/credentials][lambda$verifyFactoryOwnership$21][node9912 elected as owner for factory /c ore/auth/credentials. Starting synch ...]" ], "documentVersion": 0, "documentKind": "com:vmware:xenon:services:common:ServiceHostLogService:LogServiceState", "documentSelfLink": "/core/management/process-log", "documentUpdateTimeMicros": 0, "documentExpirationTimeMicros": 0 }

The logs of NodeC (listening on port 9914) are here: $ curl http://localhost:9914/core/management/process-log % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 3790 100 3790 0 0 246k 0 --:--:-- --:--:-- --:--:-- 246k{ "items": [ "[0][I][1475772390249][9914][startImpl][ServiceHost/2034680e listening on http://127.0.0.1:9914]", "[1][I][1475772391300][9914/core/node-groups/default][mergeRemoteAndLocalMembership][State updated, merge with node9914, self node9914, 1475772391299005]", "[2][I][1475772393320][9914/core/node-selectors/default-3x][checkAndScheduleSynchronization][Scheduling synchronization (1 nod es)]", "[3][I][1475772393320][9914/core/node-selectors/default][checkAndScheduleSynchronization][Scheduling synchronization (1 nodes) ]", "[4][I][1475772393328][9914][scheduleNodeGroupChangeMaintenance][/core/node-selectors/default-3x 1475772393328001]", "[5][I][1475772393328][9914][scheduleNodeGroupChangeMaintenance][/core/node-selectors/default 1475772393328003]", "[6][I][1475772464050][9914/core/node-groups/default][handleJoinPost][Sending POST to http://localhost:9912/core/node-groups/d efault to insert self: {\"groupReference\":\"http://127.0.0.1:9914/core/node-groups/default\",\"status\":\"AVAILABLE\",\"options\" :[\"PEER\"],\"id\":\"node9914\",\"membershipQuorum\":1,\"customProperties\":{},\"documentVersion\":2,\"documentKind\":\"com:vmware :xenon:services:common:NodeState\",\"documentSelfLink\":\"/core/node-groups/default/node9914\",\"documentUpdateTimeMicros\":147577 2464026000,\"documentExpirationTimeMicros\":0}]", "[7][I][1475772464050][9914/core/node-groups/default][mergeRemoteAndLocalMembership][Adding new peer node9912 (http://127.0.0. 1:9912/core/node-groups/default), status AVAILABLE]", "[8][I][1475772464073][9914/core/node-groups/default][mergeRemoteAndLocalMembership][State updated, merge with node9912, self node9914, 1475772464050004]", "[9][I][1475772464079][9914/core/node-selectors/default][updateCachedNodeGroupState][Node count: 2]", "[10][I][1475772464079][9914/core/node-selectors/default-3x][updateCachedNodeGroupState][Node count: 2]", "[11][W][1475772464464][9914/core/node-selectors/default-3x][lambda$null$5][Failed convergence check, will retry: Membership t imes not converged: {http://127.0.0.1:9912/core/node-groups/default=1475772394682005, http://127.0.0.1:9914/core/node-groups/defau lt=1475772464050004}]", "[12][I][1475772469520][9914/core/node-selectors/default-3x][checkAndScheduleSynchronization][Scheduling synchronization (2 no des)]", "[13][I][1475772469526][9914/core/node-selectors/default][checkAndScheduleSynchronization][Scheduling synchronization (2 nodes )]", "[14][I][1475772469545][9914][scheduleNodeGroupChangeMaintenance][/core/node-selectors/default-3x 1475772469545001]", "[15][I][1475772469569][9914][scheduleNodeGroupChangeMaintenance][/core/node-selectors/default 1475772469568001]", "[16][I][1475772470540][9914/core/authz/user-groups][lambda$verifyFactoryOwnership$21][node9914 elected as owner for factory / core/authz/user-groups. Starting synch ...]", "[17][I][1475772470540][9914/samples/cities][lambda$verifyFactoryOwnership$21][node9914 elected as owner for factory /samples/ cities. Starting synch ...]", "[18][I][1475772470552][9914/core/authz/roles][lambda$verifyFactoryOwnership$21][node9914 elected as owner for factory /core/a uthz/roles. Starting synch ...]", "[19][I][1475772470563][9914/core/transactions][lambda$verifyFactoryOwnership$21][node9914 elected as owner for factory /core/ transactions. Starting synch ...]", "[20][I][1475772470569][9914/core/authz/users][lambda$verifyFactoryOwnership$21][node9914 elected as owner for factory /core/a uthz/users. Starting synch ...]" ], "documentVersion": 0, "documentKind": "com:vmware:xenon:services:common:ServiceHostLogService:LogServiceState", "documentSelfLink": "/core/management/process-log", "documentUpdateTimeMicros": 0, "documentExpirationTimeMicros": 0 }

I tried setting the membershipQuorum when starting up NodeA. After NodeA starts and before NodeC joins, I found that I could not create city1 on NodeA. I waited for more than 3 minutes. The curl POST request I sent for creating city1 on NodeA just didn't come back with a response. It didn't time out either.

I changed the "memberhsipQuorum" field to 2 in the file peer-join-default.json and used that file to join the nodes. Same result.

chaurwu commented 8 years ago

Some more information. If I comment out the following line in CityService.java:

super.toggleOption(ServiceOption.OWNER_SELECTION, true);

Then city1 will not be deleted from NodeA after NodeC joins. Also city1 will be replicated to NodeC (listening on port 9914). I can execute "curl http://localhost:9914/samples/cities/city1" to verify that. However "curl http://localhost:9914/samples/cities" will not show a documentLink for city1.

georgechrysanthakopoulos commented 8 years ago

lets create a tracker issue, if a slack discussion reveals this to be a true problem. https://www.pivotaltracker.com/n/projects/1471320