Because everybody loves test data.
es_test_data.py
lets you generate and upload randomized test data to
your ES cluster so you can start running queries, see what performance
is like, and verify your cluster is able to handle the load.
It allows for easy configuring of what the test documents look like, what kind of data types they include and what the field names are called.
Let's assume you have an Elasticsearch cluster running.
Python and Tornado are used. Run
pip install tornado
to install Tornado if you don't have it already.
It's as simple as this:
$ python es_test_data.py --es_url=http://localhost:9200
[I 150604 15:43:19 es_test_data:42] Trying to create index http://localhost:9200/test_data
[I 150604 15:43:19 es_test_data:47] Guess the index exists already
[I 150604 15:43:19 es_test_data:184] Generating 10000 docs, upload batch size is 1000
[I 150604 15:43:19 es_test_data:62] Upload: OK - upload took: 25ms, total docs uploaded: 1000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 25ms, total docs uploaded: 2000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 19ms, total docs uploaded: 3000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 18ms, total docs uploaded: 4000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 27ms, total docs uploaded: 5000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 19ms, total docs uploaded: 6000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 15ms, total docs uploaded: 7000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 24ms, total docs uploaded: 8000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 32ms, total docs uploaded: 9000
[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took: 31ms, total docs uploaded: 10000
[I 150604 15:43:20 es_test_data:216] Done - total docs uploaded: 10000, took 1 seconds
[I 150604 15:43:20 es_test_data:217] Bulk upload average: 23 ms
[I 150604 15:43:20 es_test_data:218] Bulk upload median: 24 ms
[I 150604 15:43:20 es_test_data:219] Bulk upload 95th percentile: 31 ms
Without any command line options, it will generate and upload 1000 documents of the format
{
"name":<<str>>,
"age":<<int>>,
"last_updated":<<ts>>
}
to an Elasticsearch cluster at http://localhost:9200
to an index called
test_data
.
Requires Docker for running the app and Docker Compose for running a single ElasticSearch domain with two nodes (es1 and es2).
262144
otherwise the ElasticSearch instances will crash, see the docs
$ sudo sysctl -w vm.max_map_count=262144
$ git clone https://github.com/oliver006/elasticsearch-test-data.git
$ cd elasticsearch-test-data
$ docker-compose up --detached
$ docker run --rm -it --network host oliver006/es-test-data \
--es_url=http://localhost:9200 \
--batch_size=10000 \
--username=elastic \
--password="esbackup-password"
$ docker-compose down --volumes
python es_test_data.py --help
gives you the full set of command line
ptions, here are the most important ones:
--es_url=http://localhost:9200
the base URL of your ES node, don't
include the index name--username=<username>
the username when basic auth is required--password=<password>
the password when basic auth is required--count=###
number of documents to generate and upload--index_name=test_data
the name of the index to upload the data to.
If it doesn't exist it'll be created with these options
--num_of_shards=2
the number of shards for the index--num_of_replicas=0
the number of replicas for the index--batch_size=###
we use bulk upload to send the docs to ES, this option
controls how many we send at a time--force_init_index=False
if True
it will delete and re-create the index--dict_file=filename.dic
if provided the dict
data type will use words
from the dictionary file, format is one word per line. The entire file is
loaded at start-up so be careful with (very) large files.--data_file=filename.json|filename.csv
if provided all data in the filename will be inserted into es. The file content has to be an array of json objects (the documents). If the file ends in .csv
then the data is automatically converted into json and inserted as documents.Glad you're asking, let's get to the doc format.
The doc format is configured via --format=<<FORMAT>>
with the default being
name:str,age:int,last_updated:ts
.
The general syntax looks like this:
<<field_name>>:<<field_type>>,<<field_name>>::<<field_type>>, ...
For every document, es_test_data.py
will generate random values for each of
the fields configured.
Currently supported field types are:
bool
returns a random true or falsets
a timestamp (in milliseconds), randomly picked between now +/- 30 daysipv4
returns a random ipv4tstxt
a timestamp in the "%Y-%m-%dT%H:%M:%S.000-0000" format, randomly
picked between now +/- 30 daysint:min:max
a random integer between min
and max
. If min
and max
are not provided they default to 0 and 100000str:min:max
a word ( as in, a string), made up of min
to max
random
upper/lowercase and digit characters. If min
and max
are optional,
defaulting to 3
and 10
words:min:max
a random number of strs
, separated by space, min
and
max
are optional, defaulting to '2' and 10
dict:min:max
a random number of entries from the dictionary file,
separated by space, min
and max
are optional, defaulting to '2' and 10
text:words:min:max
a random number of words seperated by space from a
given list of -
seperated words, the words are optional defaulting to
text1
text2
and text3
, min and max are optional, defaulting to 1
and 1
arr:[array_length_expression]:[single_element_format]
an array of entries
with format specified by single_element_format
. array_length_expression
can be either a single number, or pair of numbers separated by -
(i.e. 3-7),
defining range of lengths from with random length will be picked for each array
(Example int_array:arr:1-5:int:1:250
)All suggestions, comments, ideas, pull requests are welcome!