Closed authwork closed 4 years ago
I also encounter this issue, To get the value of the same key, it sometimes can read the value, sometimes cannot.
All these issues do not appear in Anna local mode, I guess it may be caused by the configuration of the Anna Cluster mode. Would you mind giving some correct examples of cluster configuration, e.g. use 4 nodes for 3 replications in memory? @vsreekanti Many thanks
Hm, this is a surprising error. It isn't something I've seen before. Is this error occurring for disk nodes or memory nodes? It's unclear what the file descriptors are being created for, unless you're using a very large number of clients.
I am running a cluster of 8 server nodes and 64 clients to access those replicas. I only uses memory node. The number of replicas is set to be 8.
@vsreekanti Hello, I have evaluated the system under a cluster of 1 server node and 16 clients from different nodes (the replication factor is set to 1). It works normal. A cluster of 1 server node and 32 clients from different nodes also works normal. In this way, I think 8 server nodes are enough for serving 64 clients, while the issue remains.
How should we configure the seed_ip. Currectly ,I configure all server nodes to share the same seed_ip.
Sharing a single seed IP is okay. It sounds like the ulimit
for open FDs on the machine you are running on is too low. Can you try increasing the ulimit
? My hunch is that all 64 clients are talking to all nodes and as a result, there process is running into a limit for the number of allowed open sockets. That's why only using 16 clients works -- the aggregate number of clients is lower.
@vsreekanti Many thanks for your help.
Hello! Increasing the ulimit
(see here) solved the "Too many files" issue.
Now, the cluster of 1 server node and 64 clients works normal.
However, the cluster of 8 server nodes and 64 clients remains some issues: I am a little confused:
Is it necessary to run the monitors in Anna cluster mode? In my pervious experiments (the cluster of 1 server node and 64 clients), we do not run the monitors and it works normal. I am not very sure whether we do not need to run monitors in the cluster of 8 server nodes and 64 clients.
The mgmt_ip
of monitoring ("127.0.0.1" by default) and the mgmt_ip
of server ("NULL" by default)
What is the difference among public ips and private ips?
Currently, I guess that it is not necessary to run the monitors, and I set all monitor-related to "127.0.0.1". In the first step, I configure 8 replications on 8 server nodes.
My configuration files are shown below (10.2.x.x
is public ip and 10.4.x.x
is private ip):
Would you please give some suggestions?
Update: I only reduce the number of replica to 4, and it seems work normal on the cluster of 8 server nodes and 64 clients.
================================================== When I run the benchmark on it, I find a new issue: Assuming we have R1, R2, R3, R4 (four replicas). At very begining, a client PUT K1 to R1, and read from R2 at once. Since the key-value are not trasferred from R1 to R2 in time. This client may not be able to read the key-value.
================================================== When I run benchmark, I found:
client.put_async(key, serialize(val), LatticeType::LWW);
receive(&client);
client.get_async(key);
receive(&client);
receive(&client)
after each PUT/GET operation? (Just want to be sure)Regarding running the monitoring node, it should not be necessary. However, the system will not increase/decrease the number of replicas of each key in response to load change if that is a feature you would like.
You can ignore the mgmt_ip
as well -- it's used for Kubernetes autoscaling.
Public IPs and private IPs are used when running in VPCs (e.g., for EC2). All internal communication is done on private IPs, and request handling is done on public IPs. If you are not running in a VPC or don't need KVS access outside the VPC, you can just use the same IP address for both.
Regarding R2 not being able to read the KV pair, this is expected behavior. That is the nature of the coordination-freeness of the system, that you may read stale values (including NULL
as a value). If you want more deterministic behavior, you can try making a client sticky to a single replica for a particular key by changing the KVS client, but that is not something we currently support out of the box.
You can make multiple requests then call receive
if that workflow is more convenient for you.
When do you mean by the speed of batch of operations? Are you seeing that batches of operations are particularly slow?
receive
seems to work like pipelineMany thanks, I will take more try.
- Regarding running the monitoring node, it should not be necessary. However, the system will not increase/decrease the number of replicas of each key in response to load change if that is a feature you would like.
- You can ignore the
mgmt_ip
as well -- it's used for Kubernetes autoscaling.- Public IPs and private IPs are used when running in VPCs (e.g., for EC2). All internal communication is done on private IPs, and request handling is done on public IPs. If you are not running in a VPC or don't need KVS access outside the VPC, you can just use the same IP address for both.
Regarding R2 not being able to read the KV pair, this is expected behavior. That is the nature of the coordination-freeness of the system, that you may read stale values (including
NULL
as a value). If you want more deterministic behavior, you can try making a client sticky to a single replica for a particular key by changing the KVS client, but that is not something we currently support out of the box.
- You can make multiple requests then call
receive
if that workflow is more convenient for you.- When do you mean by the speed of batch of operations? Are you seeing that batches of operations are particularly slow?
@vsreekanti In my current cluster configuration (shown before), both PUT
and GET
operation caused long delay. (RF=4 seems to work normal, while long delay appears with RF=8)
I encounter the following issue: