Tests do not accurately benchmark targets

jhp612 commented 9 years ago

It is strongly recommended that some logic be added to further enhance the test data being generated and put into the object store. My requests for enhancements are as follows with reasoning provided

1) object names should be randomized in their entirety. ->S3 for example uses the object name to hash where the object will be placed internally. If all objects start with "Object" they all hash to the same place and you are limiting the performance of the service to a fraction of what it can do. As such it is AWS best practices to make sure the beginning of the object name is randomized or serialized to force 2) a path should be randomized and provided for the object (eg. /somedir/somedir/objectnane) -> Other object stores such as use the path or the "key" of the object to generate a hash similar to how S3 uses the key name to generate the hash. As such because cosbench places all objects in the root directory of the bucket, we are again limited the performance available in the system 3) A parameter should also be added to choose how many objects should exist in a path. -> combining points 1 and 2 , we would need this parameter to adequatly size the distribution and to most closely simulate use case.

ywang19 commented 9 years ago

hi jhp612,

Thanks for your requests. I'm not sure if completely understood your requests, below are my comments:

1) object names should be randomized in their entirety. ->S3 for example uses the object name to hash where the object will be placed internally. If all objects start with "Object" they all hash to the same place and you are limiting the performance of the service to a fraction of what it can do. As such it is AWS best practices to make sure the beginning of the object name is randomized or serialized to force

--> I don't know S3 internal, not sure if S3 has the object affinity when hashing. what I know with swift is similar object names like Object1, Object 2 don't cause they are hashed to the same placed. generally, one effect of hashing is minor input difference generates major output difference.

2) a path should be randomized and provided for the object (eg. /somedir/somedir/objectnane) -> Other object stores such as use the path or the "key" of the object to generate a hash similar to how S3 uses the key name to generate the hash. As such because cosbench places all objects in the root directory of the bucket, we are again limited the performance available in the system

--> Normally, the naming convention of path or object shouldn't limit the performance if the number of objects is enough. previously, I received one question said inserting sequential numeric string or random string has different performance on sqlite, but my test with 1M records doesn't see relevant difference.

3) A parameter should also be added to choose how many objects should exist in a path. -> combining points 1 and 2 , we would need this parameter to adequatly size the distribution and to most closely simulate use case.

--> so far cosbench uses stage to pre-fill storage cluster, where user can define how many objects will be created in advance.

jhp612 commented 9 years ago

1) IN regards to S3 please see https://aws.amazon.com/blogs/aws/amazon-s3-performance-tips-tricks-seattle-hiring-event/

You can see for best performance of S3 that you must start your key or path names with different chars so they hash to different partitions

2) Naming convention can have a limit on performance. Take Hitachi Content Platform for example. The hash of the file path determine what region a object is hashed to. As such if all objects are put on the root path of the bucket or namespace, they all hash to the same region and then you loose the benefits of distributed architecture

3) Basically I think COSBench should be able to generate objects with the following key values:

\RandomString + DirectoryName + RandomString\ RandomString + ObjectName + RandomString

I know currently there are options for the object to have a suffix and prefix appended to it. I think they should have the ability to be randomized. Also there is no support currently for "directories" in the key name which I believe should be there.

jhp612 commented 9 years ago

I forgot to add that In relation to endpoints the code currently assume that nodes are behind a load balancer which may not always be the case. In the case of HCP, all nodes may simply be behind DNS and rely on round robin DNS, or expect the application to connect to each node explicitly

ywang19 commented 9 years ago

Understood. about randomizing the naming scheme, one quick solution in my mind is to hash container names and object names.

about "directories", yes, so far it's two levels hierarchy (container\object), some storage supports directories, but normally, directories are treated as part of object names, current "oprefix" should support it.

about endpoints, COSBench supports to be used without load balancer. I tested a swift cluster with 3 proxy nodes, and launch 3 cosbench drivers, each interacts with one proxy server separately. actually, It's still feasible if you want to just use one cosbench driver to test 3 proxy servers, just define 3 sections, each uses different endpoint url for the 3 proxy servers. of course, it's not supported if you hope one worker to talk with those 3 servers in round robin mode.

jhp612 commented 9 years ago

1) Hashing container and object names may work (depending on the hash algorithm applied).

2)oprefix will not suffice since it is static. As such every object would be put in the same "directory" and therefore end up in the same region/partition. I have tried this already :) oprefix would need to have an option to generate random string before the "/" character

3)I will try defining a driver to hit 3 nodes in my testing with HCP. If I understand you correctly this should be a viable solution :)

jhp612 commented 9 years ago

I also forgot to mention that it would be good to be able to define in the config how many random objects to put in each randomized directory created

ywang19 commented 9 years ago

for 2), as hashing will be applied on object name, and object name = oprefix + "MyObject" + sequence, so it's possible to randomize directory as well. for 3), conf/ampli-config-sample.xml shows how to talk with different servers from one driver for your reference:-).

As mentioned, for pre-filled objects, it's done in stage as following:

Here you can define how many objects to be created in each container in "objects" parameter.

jhp612 commented 9 years ago

If you specify multiple drivers (one per node) will they run in parallel or serial ?

ywang19 commented 9 years ago

in parallel.

intel-cloud / cosbench

Tests do not accurately benchmark targets #237