manticoresoftware / manticoresearch

Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon
https://manticoresearch.com
GNU General Public License v3.0
9.02k stars 506 forks source link

Improving the script load_us_names_min_infix_len.php #2707

Open PavelShilin89 opened 1 week ago

PavelShilin89 commented 1 week ago

Proposal:

I need help modifying the load_us_names_min_infix_len.php script, which is used in several tests. The script is located in the wizard at the path ./test/clt-tests/scripts/load_us_names_min_infix_len.php. The features required to be added to the script are:

  1. In the script run command, control the arguments by addressing them with --argument-name=, e.g. --batch-size=100000 --concurrency=4 --docs=1000000.
  2. If the argument is not specified in the startup command, it should be disabled by default.
  3. The location or ordering of arguments in the startup command should have no effect on functionality.
  4. It is mandatory to be able to create data with the given script and the given arguments that will be identical each time the script is run.

Checklist:

To be completed by the assignee. Check off tasks that have been completed or are not applicable.

- [ ] Implementation completed - [ ] Tests developed - [ ] Documentation updated - [ ] Documentation reviewed - [ ] [Changelog](https://docs.google.com/spreadsheets/d/1mz_3dRWKs86FjRF7EIZUziUDK_2Hvhd97G0pLpxo05s/edit?pli=1&gid=1102439133#gid=1102439133) updated - [x] OpenAPI YAML updated and issue created to rebuild clients
sanikolaev commented 6 days ago

I've commited the updated script in https://github.com/manticoresoftware/manticoresearch/pull/2718

How it works now:

➜  manticore_github git:(master) ✗ php ./test/clt-tests/scripts/load_us_names_min_infix_len.php --help
Usage: /Users/sn/manticore_github/test/clt-tests/scripts/load_us_names_min_infix_len.php [options]
Options:
  --batch-size=<number>      Number of records per batch (default: 1000)
  --concurrency=<number>     Number of concurrent connections (default: 4)
  --docs=<number>            Total number of documents to insert (default: 1000000)
  --min-infix-len=<number>   Optional minimum infix length for table (default: none)
  --start-id=<number>        Starting ID for document insertion (default: 1)
  --drop-table               Drop and create the table before inserting data (default: true)
  --no-drop-table            Prevent the table from being dropped and created
  --help                     Show this help message```

1M docs example showing the same data is loaded when the script is run the 2nd time:

➜  manticore_github git:(master) ✗ php ./test/clt-tests/scripts/load_us_names_min_infix_len.php
preparing...
found in cache
querying...
finished inserting
Total time: 4.6767749786377
213822 docs per sec
➜  manticore_github git:(master) ✗ mysqldump -P9306 -h0 -etc manticore name|grep INSERT|md5sum
-- Warning: version string returned by server is incorrect.
-- Warning: column statistics not supported by the server.
bd1aa58895d1759750e50fe55709949e  -
➜  manticore_github git:(master) ✗ php ./test/clt-tests/scripts/load_us_names_min_infix_len.php
preparing...
found in cache
querying...
finished inserting
Total time: 8.7348871231079
114483 docs per sec
➜  manticore_github git:(master) ✗ mysqldump -P9306 -h0 -etc manticore name|grep INSERT|md5sum
-- Warning: version string returned by server is incorrect.
-- Warning: column statistics not supported by the server.
bd1aa58895d1759750e50fe55709949e  -

Another example demonstrating inserting more data to an existing table:

➜  manticore_github git:(master) ✗ php ./test/clt-tests/scripts/load_us_names_min_infix_len.php --batch-size=100 --concurrency=1 --docs=1000 --min_infix_len=2 --start-id=1
Table 'name' dropped and recreated.
preparing...
100%       querying...
finished inserting
Total time: 0.007519006729126
132874 docs per sec
➜  manticore_github git:(master) ✗ mysql -P9306 -h0 -e "flush ramchunk name"
➜  manticore_github git:(master) ✗ php ./test/clt-tests/scripts/load_us_names_min_infix_len.php --batch-size=100 --concurrency=1 --docs=1000 --min_infix_len=2 --start-id=1001 --no-drop-table
preparing...
100%       querying...
finished inserting
Total time: 0.0079059600830078
126376 docs per sec
➜  manticore_github git:(master) ✗ mysql -P9306 -h0 -e "flush ramchunk name"
➜  manticore_github git:(master) ✗ mysql -P9306 -h0 -e "optimize table name option sync=1, cutoff=1"
➜  manticore_github git:(master) ✗ mysqldump -P9306 -h0 -etc manticore name|grep INSERT|md5sum
-- Warning: version string returned by server is incorrect.
-- Warning: column statistics not supported by the server.
df0a65236760c48cf1d54e83929a9bf2  -

➜  manticore_github git:(master) ✗ php ./test/clt-tests/scripts/load_us_names_min_infix_len.php --batch-size=100 --concurrency=1 --docs=1000 --min_infix_len=2 --start-id=1
Table 'name' dropped and recreated.
preparing...
100%       querying...
finished inserting
Total time: 0.049274921417236
20287 docs per sec
➜  manticore_github git:(master) ✗ mysql -P9306 -h0 -e "flush ramchunk name"
➜  manticore_github git:(master) ✗ php ./test/clt-tests/scripts/load_us_names_min_infix_len.php --batch-size=100 --concurrency=1 --docs=1000 --min_infix_len=2 --start-id=1001 --no-drop-table
preparing...
100%       querying...
finished inserting
Total time: 0.0061080455780029
163502 docs per sec
➜  manticore_github git:(master) ✗ mysql -P9306 -h0 -e "flush ramchunk name"
➜  manticore_github git:(master) ✗ mysql -P9306 -h0 -e "optimize table name option sync=1, cutoff=1"
➜  manticore_github git:(master) ✗ mysqldump -P9306 -h0 -etc manticore name|grep INSERT|md5sum
-- Warning: version string returned by server is incorrect.
-- Warning: column statistics not supported by the server.
df0a65236760c48cf1d54e83929a9bf2  -

Note, I've removed all related with:

Please test the updated script and let me know if there's any issue with it.