Fixing uneven distribution of data between nodes

vstax commented 7 years ago

When loading nodes with data (N=3, 6 nodes) the distribution of data is uneven between nodes:

[vm@bodies-master ~]$ leofs-adm du bodies05@stor05.selectel.cloud.lan
 active number of objects: 3052544
  total number of objects: 3134300
   active size of objects: 1956316031577
    total size of objects: 1956340394865
     ratio of active size: 100.0%
    last compaction start: 2017-09-24 19:42:01 +0300
      last compaction end: 2017-09-24 19:42:55 +0300

[vm@bodies-master ~]$ leofs-adm du bodies06@stor06.selectel.cloud.lan
 active number of objects: 2622980
  total number of objects: 2693670
   active size of objects: 1679815326349
    total size of objects: 1679836391969
     ratio of active size: 100.0%
    last compaction start: 2017-09-24 19:08:15 +0300
      last compaction end: 2017-09-24 19:09:02 +0300

The difference is stable, it doesn't change no matter how many objects are stored. Distribution difference in amount of objects (or size - which is the same here) between 5th and 6th node is 16% and always remains the same.

If taking amount of objects (or size) on 6th node as 100% usage, then all 6 nodes have 113% / 111% / 103% / 112% / 116% / 100% usage.

Nodes are exactly the same. There are no problems in distribution between different AVS directories:

[vm@stor05 ~]$ LANG=C df -m
Filesystem                 1M-blocks   Used Available Use% Mounted on
...
/dev/sdc2                    5671127 467079   5204049   9% /mnt/avs3
/dev/sdb4                    5670127 466677   5203451   9% /mnt/avs2
/dev/sdd2                    5671127 467052   5204076   9% /mnt/avs4
/dev/sda4                    5670127 467011   5203117   9% /mnt/avs1

Is it possible to fix this difference manually? (I understand that this will recalculate RING and will require rebalance / compaction to be performed, it's not a problem). I'd like to set these settings, wipe the data and load it again. Because the difference in distribution is always stable no matter how many objects are loaded, I believe that such change will work the same in the future, no matter how many objects are on nodes. While 16% difference isn't that big, that's still up to ~3.6 TB difference in data size between nodes, I'd like to make it more even, if possible - reduce to few % or something.

mocchira commented 7 years ago

WIP

mocchira commented 7 years ago

@vstax After discussing among the team, we've come to the conclusion below.

If you have to make it more even then we'd like to propose the way tweaking num_of_vnodes on each leo_storage according to the trend you could see (however there are some considerations that should be taken into account beforehand, please follow the below instructions)
As changing num_of_vnodes can cause LeoFS to generate a new RING that may have some records which replication destinations are not overlapped with the previous RING at all, leofs-adm recover-node can't do its jobs in such cases so we have to check whether such things will not happen beforehand.
In order to check the above thing and choose the optimal num_of_vnodes for your case, we'd like to test this case on as much similar environment to yours as possible so can you give us the below info?
- configuration files on all nodes
- the number of total objects
- histgram for the object size across the cluster
- the number of objects and the disk usage stored on each leo_storage
We will tell you optimal num_of_vnodes for each leo_storage after finishing the tests in our env

P.S. With the erasure coding (EC) that will be shipped on 1.4 release, you could expect uneven distribution problem to be mitigated if you set the fragment size used by EC to more smaller one rather than 5MB (the current default for large object).

vstax commented 7 years ago

@mocchira Thank you for looking at this.

Actually, I'm always getting somewhat uneven distributions. E.g. cluster for experiments (filled with older "dev" data twice, 2M objects):

[vm@leo-m0 ~]$ leofs-adm du storage_0@192.168.3.53
 active number of objects: 1348208
  total number of objects: 1354898
   active size of objects: 191655130794
    total size of objects: 191667623380
     ratio of active size: 99.99%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

[vm@leo-m0 ~]$ leofs-adm du storage_1@192.168.3.54
 active number of objects: 1502466
  total number of objects: 1509955
   active size of objects: 213408666322
    total size of objects: 213421617475
     ratio of active size: 99.99%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

(11% difference)

Dev cluster (just like production, but for dev environment. Object distribution is a bit different from production and leaning towards bigger ones. 1.1M objects):

[vm@bodies-m0 ~]$ leofs-adm du storage_1@192.168.3.84
 active number of objects: 703652
  total number of objects: 707144
   active size of objects: 103146068234
    total size of objects: 103147108850
     ratio of active size: 100.0%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

[vm@bodies-m0 ~]$ leofs-adm du storage_2@192.168.3.85
 active number of objects: 759210
  total number of objects: 763061
   active size of objects: 110776419412
    total size of objects: 110777567010
     ratio of active size: 100.0%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

Around 7% difference. Both these clusters are are N=2 with 3 storage nodes. It doesn't matter how many objects are uploaded - for each cluster these distributions are fixed when only 10% of objects are there and don't change from that point on.

Production nodes have N=3 and 6 storage nodes. Amount of objects was 5.5M at the point when I gave you these numbers (it's about twice of that now), but the difference between disk usage on nodes was the same whether it was 1M objects, 5M or 10M. With this amount of objects, the average object size on each node is nearly the same in call cases (i.e. it doesn't matter how to compare different nodes - by number of objects or their size).

Configuration for experimental cluster (in conf.d format, only changes from default config. I removed changed directory paths as well), manager:

nodename = manager_0@192.168.3.50
manager.partner = manager_1@192.168.3.51

consistency.num_of_replicas = 2
consistency.write = 2
consistency.read = 1
consistency.delete = 1

Storage (3 nodes):

nodename = storage_0@192.168.3.53
managers = [manager_0@192.168.3.50, manager_1@192.168.3.51]

obj_containers.path = [/mnt/avs]
obj_containers.num_of_containers = [8]

object_storage.is_strict_check = true
watchdog.error.is_enabled = true
object_storage.threshold_of_slow_processing = 5000

Configuration for dev cluster, manager nodes - same as experimental cluster. Storage (3 nodes):

nodename = storage_0@192.168.3.83
managers = [manager_0@192.168.3.80, manager_1@192.168.3.81]

obj_containers.path = [/mnt/avs]
obj_containers.num_of_containers = [16]

object_storage.is_strict_check = true
watchdog.error.is_enabled = true
object_storage.threshold_of_slow_processing = 5000

Histogram for dev cluster (experimental cluster will be the same, just scaled numbers). Out of 1095117 objects, 12% less than 10K size, 28% between 10K and 50K size, 33% between 50K and 100K size, 18% between 100K and 1M size, 6.5% between 1M and 5M size, 2% above 5M size.

Production cluster, manager nodes:

nodename = manager_0@bodies-master.selectel.cloud.lan
manager.partner = manager_1@bodies-slave.selectel.cloud.lan

consistency.num_of_replicas = 3
consistency.write = 2
consistency.read = 1
consistency.delete = 1

Storage (6 nodes):

nodename = bodies02@stor02.selectel.cloud.lan
managers = [manager_0@bodies-master.selectel.cloud.lan, manager_1@bodies-slave.selectel.cloud.lan]

obj_containers.path = [/mnt/avs1/bodies,/mnt/avs2/bodies,/mnt/avs3/bodies,/mnt/avs4/bodies]
obj_containers.num_of_containers = [64,64,64,64]

object_storage.is_strict_check = true
watchdog.error.is_enabled = true
object_storage.threshold_of_slow_processing = 5000

As for histogram. Right now (for this upload) a certain subset of production data was picked, one which heavily favors large files. So right now histogram is: 2% objects less than 100K size, 83% from 100K to 1M, 13% from 1M to 5M, less than 2% above 5M.

For real production data histogram will be - 47% less than 10K size, 11% between 10K and 50K, 7% between 50K and 100K, 32% between 100K and 1M, 2% between 1M and 5M, 0.5% above 5M. That will change somewhat after we got LeoFS fully running as we plan to raise limits on object sizes that we store (right now we don't storage objects >20M except for special circumstances).

About num_of_vnodes tweaking, there is one thing that bothers me. Supposedly default RING for 6 nodes calculates so node A has 15% more objects than node B for my data. I, knowing that, tweak num_of_vnodes to compensate for that. I upload data and everything is even. At some point more nodes are needed, say node G and node H are added. Now we have 8 nodes, however if I started from 8 nodes I'd definitely had different distribution between nodes A and B (it could be that B had 15% more objects than A, for example). So how will it be with 8 nodes when num_of_vnodes was tweaked between A and B for 6 nodes case? Will it still work as before or do more harm than good? Also, if I tweaked that values for all 6 nodes, I'll have absolutely no idea how to set it for nodes G and H except for default values - since I'll be able to see new distribution only after rebalance. At which point it'll be too late since I obviously can't change num_of_vnodes on working cluster with data safely. (that said, even if it is a problem, I still would like to tweak num_of_vnodes now)

Regarding EC feature, it sounds really nice but we are definitely launching production as soon as possible (i.e. after this and some more minor issues are resolved). Also it will probably change a lot so it'll require lots of testing and such anyway first.

mocchira commented 7 years ago

@vstax Thanks for sharing the detailed info. we've started to vet further.

Regarding EC feature, it sounds really nice but we are definitely launching production as soon as possible (i.e. after this and some more minor issues are resolved). Also it will probably change a lot so it'll require lots of testing and such anyway first.

Got it. Anyway please let us know what some more minor issues actually are? We will prioritize those issues and may consider to ship the next stable release if possible. (as 1.4 will become kind of big release so lots of changes will be included that means you'd have to lots of tests).

vstax commented 7 years ago

@mocchira Sorry for wrong wording; these "minor issues" are not from LeoFS itself or can be ignored / worked around. E.g. I filled ticket for #858 but it's more of a future request (like, this needs to be done to provide proper example of LB configuration in documentation for other users) as I have workaround for that. It would be nice to get recover-file from #845 to work as well, but this can be done later, too. Interestingly, I'm able to greatly reduce errors in the first place (i.e. I got just two such errors per 5M uploaded objects, out of which 100K are large files) by using 3 gateways behind load balancer - I had around 25 errors while uploading same amount of data through a single gateway. These errors seem appear more often when uploading data in parallel but less if multiple gateways are used so each is less loaded. Or maybe there is some third factor here.

Personally, since I'm building and using my own packages I don't care much about release right now, more of there being no bugs (I understand that releases get more tests than just develop branch but since I've checked the changes in repo since last release it's the same for me).

Currently the only stopping issues are this one and #859 - which was totally unexpected and made me stop testing process. I know I could clear these queues but some tests I wanted to do right now (while LeoFS is working in production as secondary source of data, not primary) involve how system behaves during just recover-node operation, during recover-node under load after losing a drive and double failure (losing drive, recover-node is launched, and during recovery another node completely gone and installed anew and second recovery is launched as well). So if this is caused by some bug, it's pointless to do second and third test until it's fixed...

mocchira commented 7 years ago

@vstax Thanks for the reply. we will re-prioritize those issues based on your feedback.

mocchira commented 7 years ago

WIP

vstax commented 7 years ago

@mocchira Please take a look at this

[vm@bodies-master ~]$ for a in `seq 10000`; do (echo -n "whereis "; head -c 56 < /dev/urandom | xxd -p -c 100) | nc -C localhost 10010 | grep stor >> whereis.txt; done
[vm@bodies-master ~]$ grep stor01 whereis.txt |wc -l
5285
[vm@bodies-master ~]$ grep stor02 whereis.txt |wc -l
5148
[vm@bodies-master ~]$ grep stor03 whereis.txt |wc -l
4605
[vm@bodies-master ~]$ grep stor04 whereis.txt |wc -l
5089
[vm@bodies-master ~]$ grep stor05 whereis.txt |wc -l
5321
[vm@bodies-master ~]$ grep stor06 whereis.txt |wc -l
4552
[vm@bodies-master ~]$ bc
bc 1.06.95
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'. 
scale=3
5321/4552
1.168

Yes it's extremely ineffective and takes a long time to execute (maybe you can rewrite it directly around leo_manager_api:whereis?) but shows the problem precisely. The distribution will always be like that when executed again (with different random strings) or when increasing amount of names. If you could integrate this code into function (input parameters: length of each (random) name and amount of random names, result: amount of times each storage node appears in the output of whereis function) it can be used to quickly evaluate how well balanced current RING is.

I've tried with 20-byte random strings as well (exactly the same distribution) and more real names (checking if the problem only exists for longer random strings) like

[vm@bodies-master ~]$ for a in `sort -R /usr/share/dict/words | head -n 10000`; do echo "whereis $a"| nc -C localhost 10010 | grep stor >> whereis4.txt; done
[vm@bodies-master ~]$ grep stor01 whereis4.txt |wc -l
5134
[vm@bodies-master ~]$ grep stor02 whereis4.txt |wc -l
5205
[vm@bodies-master ~]$ grep stor03 whereis4.txt |wc -l
4675
[vm@bodies-master ~]$ grep stor04 whereis4.txt |wc -l
5157
[vm@bodies-master ~]$ grep stor05 whereis4.txt |wc -l
5323
[vm@bodies-master ~]$ grep stor06 whereis4.txt |wc -l
4506

but the distribution is always the same. The fact that I used only alphanumeric characters for these tests shouldn't matter for correct hash function - and since there seem to be no differences in distribution between short real words, longer random strings and even very long random strings, that part definitely works correctly.

mocchira commented 7 years ago

@vstax Thanks for sharing your experiment. actually I have been doing the same kind of tests and now suspect there may be something wrong in RING distribution. still not sure but I've found a few things that may affect the distribution range so now I'm writing a script to check the actual range (32bit space) assigned to each node and detect which part may be wrong.

mocchira commented 7 years ago

@vstax Can you share the ring_cur_worker.log.xxxx on the cluster you did experiments? if my guess is correct then the first record of vnodeid_nodes stores the below 3 nodes

stor01
stor02
stor05 or stor04

the reason why this causes uneven distribution is there is a special range where ID < the first vnode_id or the last vnode_id < ID and if the hash computed belongs to this special range then the corresponding object is regarded as belonging to the first record of vnodeid_nodes so those nodes are inclined to store much records than others.

vstax commented 7 years ago

@mocchira First record stores nodes

    [{vnodeid_nodes,1,0,192100458540557104887353513768385232,
         [{redundant_node,'bodies02@stor02.selectel.cloud.lan',true,
              true,undefined},
          {redundant_node,'bodies03@stor03.selectel.cloud.lan',true,
              true,undefined},
          {redundant_node,'bodies06@stor06.selectel.cloud.lan',true,
              true,undefined}]},

Full version attached: ring_cur_worker.log.63674417934.gz Dump ring_cur.dump.63674418994.txt

yosukehara commented 6 years ago

LeoFS' redundant-manager randomly assigns a virtual-nodes on the RING (2 ^ 128). Therefore, its bias is inevitable. I have been considering that I design the virtual-node rebalance feature to fix the issue in near future, v1.4 or later.

mocchira commented 6 years ago

As the virtual-node rebalance feature will depend heavily on leo_konsul (leo_redundant_manager replacement) and leo_konsol will have lots of code changes and may impact the LeoFS quality, we will postpone this issue to 2.0.0.

vstax commented 6 years ago

This kind of turned into discussion about how to solve uneven distribution after it happened - but how about just making sure it doesn't happen instead?

For example, when creating new cluster first RING generated had really uneven distribution. (I checked by supplying ~2000 strings to "whereis" command). So I generated it again and again and third one was much more even than the first one. Maybe - if changing generation algorithm to fix the problem is too complicated - it's possible to just generate multiple times? Unfortunately, external "whereis" command is very slow and what I did took minutes, but maybe it will be ok performance-wise if done internally.

Like, provide amount of tries to RING generation and it's generated multiple times and the most even one is picked? :) Yes it seems pretty hackish but this is one-time problem anyway (or "few times", if counting rebalance) so maybe this would be the simplest solution. Some solution is needed for fixing generation anyway, even if virtual-node rebalance is implemented, it would be no good if distribution is still uneven after first generation and before rebalance?

mocchira commented 6 years ago

@vstax Thanks for pointing this issue again. WIP however we are going to provide some kind of solution also for 1.4.x. and also will conduct some experiments to validate whether or not a degree of the current uneven distribution is expected with various factors which can affect the distribution (file name, num_of_vnodes, number of storages, replication factor).

leo-project / leofs

Fixing uneven distribution of data between nodes #846