manticoresoftware / manticoresearch

Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon
https://manticoresearch.com
GNU General Public License v3.0
9.07k stars 509 forks source link

Significant JSON overhead compared to SQL #1794

Closed sanikolaev closed 8 months ago

sanikolaev commented 9 months ago

In some case the same query which takes about 80ms via SQL can take 250ms via JSON.

MRE (dev2):

wget https://raw.githubusercontent.com/zeraye/names-surnames-list/master/male-names-list.txt -O names.txt
wget https://raw.githubusercontent.com/zeraye/names-surnames-list/master/female-names-list.txt -O fnames.txt
cat fnames.txt >> names.txt
wget https://raw.githubusercontent.com/zeraye/names-surnames-list/master/surnames-list.txt -O surnames.txt
wget https://gist.githubusercontent.com/sanikolaev/5f4cea361b6869ee9ca7eb815be4fb76/raw/15655adc2dcb3a0c7680d07127fa50a3528a8d73/load_us_names.php -O load_us_names.php
php load_us_names.php 10000 10 10000000 1

snikolaev@dev2:~$ time curl -sX POST http://localhost:9308/search -d '{"index" : "name", "query" : {"match" : {"_all": "on"} }, "limit": 0}'
{"took":245,"timed_out":false,"hits":{"total":1719457,"total_relation":"eq","hits":[]}}
real    0m0.250s
user    0m0.003s
sys 0m0.003s

/* Mon Feb 05 10:48:48.253 2024 conn 2087997 real 0.244 wall 0.245 found 1719457 */  /*{"index" : "name", "query" : {"match" : {"_all": "on"} }, "limit": 0} */

snikolaev@dev2:~$ time mysql -P9306 -h0 -e "select * from name where match('on'); show meta;"
+---------+-------------------+
| id      | username          |
+---------+-------------------+
| 9750002 | LAVONIA HAIRSTON  |
| 9160015 | CLAYTON STOCKTON  |
| 9930016 | WILSON EVERSON    |
| 9930026 | WINSTON CINTRON   |
| 9880033 | DESMOND HARDISON  |
| 9660043 | DIONE JOHNSTON    |
| 9080055 | EDISON MCKINNON   |
| 9810061 | SON NIELSON       |
| 9750062 | SIMONNE LONG      |
| 9880068 | SONA ALSTON       |
| 9840085 | LEONOR DRAYTON    |
| 9260098 | LONA LONGORIA     |
| 9420101 | MARION CLIFTON    |
| 9580103 | RONDA DRUMMOND    |
| 9390104 | ASUNCION XIONG    |
| 9420113 | CLINTON PATTERSON |
| 9340117 | LAVONA DESIMONE   |
| 9530126 | ALISON SON        |
| 9390130 | JUDSON CONNELLY   |
| 9080134 | QUINTON ARAGON    |
+---------+-------------------+
+----------------+---------+
| Variable_name  | Value   |
+----------------+---------+
| total          | 20      |
| total_found    | 1719457 |
| total_relation | eq      |
| time           | 0.075   |
| keyword[0]     | *on*    |
| docs[0]        | 1794910 |
| hits[0]        | 1794951 |
| keyword[1]     | on      |
| docs[1]        | 0       |
| hits[1]        | 0       |
+----------------+---------+

real    0m0.082s
user    0m0.000s
sys 0m0.005s

/* Mon Feb 05 10:48:05.657 2024 conn 2087618 real 0.074 wall 0.075 found 1719457 */ SELECT * FROM name WHERE MATCH('on');

When there's only one disk chunk the difference is even larger (the likely related issue is https://github.com/manticoresoftware/manticoresearch/issues/1767):

snikolaev@dev2:~$ mysql -P9306 -h0 -e "optimize table name option sync=1, cutoff=1;"

snikolaev@dev2:~$ time curl -sX POST http://localhost:9308/search -d '{"index" : "name", "query" : {"match" : {"_all": "on"} }, "limit": 0}'
{"took":583,"timed_out":false,"hits":{"total":1717748,"total_relation":"eq","hits":[]}}
real    0m0.589s
user    0m0.000s
sys 0m0.006s

snikolaev@dev2:~$ time mysql -P9306 -h0 -e "select * from name where match('on') limit 0; show meta;"
+----------------+---------+
| Variable_name  | Value   |
+----------------+---------+
| total          | 0       |
| total_found    | 1717748 |
| total_relation | eq      |
| time           | 0.088   |
| keyword[0]     | *on*    |
| docs[0]        | 1793344 |
| hits[0]        | 1793383 |
| keyword[1]     | on      |
| docs[1]        | 0       |
| hits[1]        | 0       |
+----------------+---------+

real    0m0.094s
user    0m0.000s
sys 0m0.006s

Version:

snikolaev@dev2:~$ searchd -v
Manticore 6.2.13 4950f3d02@24020213 dev (columnar 2.2.5 214ce90@240115) (secondary 2.2.5 214ce90@240115) (knn 2.2.5 214ce90@240115)
sanikolaev commented 9 months ago

Same in manticoresearch/manticore:4.0.2, i.e. not a new regression. Didn't check earlier versions.

sanikolaev commented 9 months ago

Interesting that if I disable auto-expansion, then in case of the on* query, the response time is the same:

snikolaev@dev2:~$ time curl -sX POST http://localhost:9308/search -d '{"index" : "name", "query" : {"match" : {"_all": "on*"} }}'
{"took":2,"timed_out":false,"hits":{"total":17482,"total_relation":"eq","hits":[{"_id":"1559568","_score":1670,"_source":{"username":"ONIE ONEAL"}},{"_id":"8901204","_score":1660,"_source":{"username":"ONEIDA ONG"}},{"_id":"9558471","_score":1660,"_source":{"username":"ONA ONEAL"}},{"_id":"860282","_score":1624,"_source":{"username":"DENAE ONEIL"}},{"_id":"860385","_score":1624,"_source":{"username":"ONITA TIERNEY"}},{"_id":"60418","_score":1624,"_source":{"username":"ONIE GUFFEY"}},{"_id":"860591","_score":1624,"_source":{"username":"MICHA ONEIL"}},{"_id":"860642","_score":1624,"_source":{"username":"ONITA CLEGG"}},{"_id":"860969","_score":1624,"_source":{"username":"KAROLE ONEILL"}},{"_id":"861828","_score":1624,"_source":{"username":"PHILLIP ONEAL"}},{"_id":"62159","_score":1624,"_source":{"username":"ONEIDA BRILL"}},{"_id":"862854","_score":1624,"_source":{"username":"ONITA OLMSTEAD"}},{"_id":"63206","_score":1624,"_source":{"username":"EULAH ONTIVEROS"}},{"_id":"63252","_score":1624,"_source":{"username":"RAUL ONEILL"}},{"_id":"863311","_score":1624,"_source":{"username":"ONA COBB"}},{"_id":"63396","_score":1624,"_source":{"username":"JESUS ONG"}},{"_id":"63751","_score":1624,"_source":{"username":"LUCIANA ONEAL"}},{"_id":"863775","_score":1624,"_source":{"username":"ONEIDA ROARK"}},{"_id":"864149","_score":1624,"_source":{"username":"KEELEY ONEAL"}},{"_id":"864572","_score":1624,"_source":{"username":"ONIE PUCKETT"}}]}}
real    0m0.008s
user    0m0.000s
sys 0m0.006s

snikolaev@dev2:~$ time mysql -P9306 -h0 -e "select * from name where match('on*'); show meta;"
+---------+-----------------+
| id      | username        |
+---------+-----------------+
| 1559568 | ONIE ONEAL      |
| 8901204 | ONEIDA ONG      |
| 9558471 | ONA ONEAL       |
|  860282 | DENAE ONEIL     |
|  860385 | ONITA TIERNEY   |
|   60418 | ONIE GUFFEY     |
|  860591 | MICHA ONEIL     |
|  860642 | ONITA CLEGG     |
|  860969 | KAROLE ONEILL   |
|  861828 | PHILLIP ONEAL   |
|   62159 | ONEIDA BRILL    |
|  862854 | ONITA OLMSTEAD  |
|   63206 | EULAH ONTIVEROS |
|   63252 | RAUL ONEILL     |
|  863311 | ONA COBB        |
|   63396 | JESUS ONG       |
|   63751 | LUCIANA ONEAL   |
|  863775 | ONEIDA ROARK    |
|  864149 | KEELEY ONEAL    |
|  864572 | ONIE PUCKETT    |
+---------+-----------------+
+----------------+-------+
| Variable_name  | Value |
+----------------+-------+
| total          | 20    |
| total_found    | 17482 |
| total_relation | eq    |
| time           | 0.002 |
| keyword[0]     | on*   |
| docs[0]        | 17493 |
| hits[0]        | 17493 |
+----------------+-------+

real    0m0.008s
user    0m0.006s
sys 0m0.000s

but *on* is still slow.

So it looks like the issue has smth do with infix search.

tomatolog commented 9 months ago

could be different default values for query via SQL and JSON endpoints

sanikolaev commented 9 months ago

Another example with a different keyword which makes the difference much bigger: 49 sec vs 671 ms

wget https://raw.githubusercontent.com/Raven-SL/ru-pnames-list/master/lists/male_names_rus.txt
wget https://raw.githubusercontent.com/Raven-SL/ru-pnames-list/master/lists/male_surnames_rus.txt
wget https://gist.githubusercontent.com/sanikolaev/5f4cea361b6869ee9ca7eb815be4fb76/raw/c203ad550ede2a0e1b5dab19a780e833b2367827/load_names.php -O load_names.php
php load_names.php 10000 10 10000000 1
mysql -P9306 -h0 -e "optimize table name option sync=1, cutoff=1;"
echo 'eyJpbmRleCIgOiAibmFtZSIsICJxdWVyeSIgOiB7Im1hdGNoIiA6IHsiX2FsbCI6ICLQvtCyIn19
fQo='|base64 -d > /tmp/query

snikolaev@dev2:~$ curl -vsX POST http://localhost:9308/search -d @/tmp/query
*   Trying 127.0.0.1:9308...
* Connected to localhost (127.0.0.1) port 9308 (#0)
> POST /search HTTP/1.1
> Host: localhost:9308
> User-Agent: curl/7.81.0
> Accept: */*
> Content-Length: 58
> Content-Type: application/x-www-form-urlencoded
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: 6.2.13 4950f3d02@24020213 dev (columnar 2.2.5 214ce90@240115) (secondary 2.2.5 214ce90@240115) (knn 2.2.5 214ce90@240115) (buddy v2.1.3)
< Content-Type: application/json; charset=UTF-8
< Content-Length: 1844
<
{"took":49781,"timed_out":false,"hits":{"total":4054337,"total_relation":"eq","hits":[{"_id":"9740003","_score":1539,"_source":{"username":"Джованни Галаганов"}},{"_id":"9960072","_score":1539,"_source":{"username":"Овадья Циолковский"}},{"_id":"9610093","_score":1539,"_source":{"username":"Овадья Недобитов"}},{"_id":"9700096","_score":1539,"_source":{"username":"Джованни Вислоухов"}},{"_id":"9900125","_score":1539,"_source":{"username":"Джованни Басунов"}},{"_id":"9880131","_score":1539,"_source":{"username":"Иов Хухряков"}},{"_id":"9980148","_score":1539,"_source":{"username":"Иов Сафьянов"}},{"_id":"9900166","_score":1539,"_source":{"username":"Людовик Калиновский"}},{"_id":"9610185","_score":1539,"_source":{"username":"Яков Сабашников"}},{"_id":"9930195","_score":1539,"_source":{"username":"Овадья Осонов"}},{"_id":"9660208","_score":1539,"_source":{"username":"* Connection #0 to host localhost left intact
Иов Безруков"}},{"_id":"9580212","_score":1539,"_source":{"username":"Джованни Аббакумов"}},{"_id":"9750238","_score":1539,"_source":{"username":"Людовик Пешников"}},{"_id":"9780246","_score":1539,"_source":{"username":"Яков Флегентов"}},{"_id":"9930279","_score":1539,"_source":{"username":"Джованни Филипович"}},{"_id":"9900294","_score":1539,"_source":{"username":"Яков Грабаров"}},{"_id":"9930296","_score":1539,"_source":{"username":"Яков Кованько"}},{"_id":"9750298","_score":1539,"_source":{"username":"Иов Годовалов"}},{"_id":"9600301","_score":1539,"_source":{"username":"Овадья Пестряков"}},{"_id":"9880310","_score":1539,"_source":{"username":"Джованни Анцифиров"}}]}}

snikolaev@dev2:~$ time mysql -P9306 -h0 -e "select * from name where match('ов'); show meta;"
+---------+---------------------------------------+
| id      | username                              |
+---------+---------------------------------------+
| 9740003 | Джованни Галаганов                    |
| 9960072 | Овадья Циолковский                    |
| 9610093 | Овадья Недобитов                      |
| 9700096 | Джованни Вислоухов                    |
| 9900125 | Джованни Басунов                      |
| 9880131 | Иов Хухряков                          |
| 9980148 | Иов Сафьянов                          |
| 9900166 | Людовик Калиновский                   |
| 9610185 | Яков Сабашников                       |
| 9930195 | Овадья Осонов                         |
| 9660208 | Иов Безруков                          |
| 9580212 | Джованни Аббакумов                    |
| 9750238 | Людовик Пешников                      |
| 9780246 | Яков Флегентов                        |
| 9930279 | Джованни Филипович                    |
| 9900294 | Яков Грабаров                         |
| 9930296 | Яков Кованько                         |
| 9750298 | Иов Годовалов                         |
| 9600301 | Овадья Пестряков                      |
| 9880310 | Джованни Анцифиров                    |
+---------+---------------------------------------+
+----------------+---------+
| Variable_name  | Value   |
+----------------+---------+
| total          | 20      |
| total_found    | 4054337 |
| total_relation | eq      |
| time           | 0.671   |
| keyword[0]     | *ов*    |
| docs[0]        | 4081690 |
| hits[0]        | 4081690 |
| keyword[1]     | ов      |
| docs[1]        | 0       |
| hits[1]        | 0       |
+----------------+---------+

real    0m0.689s
user    0m0.000s
sys 0m0.009s
klirichek commented 8 months ago

That is dupe of https://sphinxsearch.com/bugs/view.php?id=1253 (appliable to http queries)

sanikolaev commented 8 months ago

I confirm it looks ok now:

snikolaev@dev2:~$ time (for n in `seq 1 100`; do mysql -P9306 -h0 -e "select * from name where match('on'); show meta;"; done > /dev/null)

real    0m8.881s
user    0m0.261s
sys 0m0.289s
snikolaev@dev2:~$ time (for n in `seq 1 100`; do curl -sX POST http://localhost:9308/search -d '{"index" : "name", "query" : {"match" : {"_all": "on"} }, "limit": 0}'; done > /dev/null)

real    0m8.713s
user    0m0.184s
sys 0m0.379s

@PavelShilin89 let's try to cover this with a small CLT test and see how stable it is in the github runners environment. Notes:

  1. You can use the issue's body as the way to populate Manticore with data for the test
  2. It's better to copy the script and the names/surnames from my gist directly to the test dir
  3. Call it github-manticoresearch-1794-curl-vs-mysql-performance, so we know what issue it relates to and what it is about.
  4. To check the performance metrics you can do this:
# Capture the current time in nanoseconds before starting the curl operations
start=$(date +%s%N)

# Execute a curl POST request 100 times in a loop, discard the output
for n in `seq 1 100`; do curl -sX POST http://localhost:9308/search -d '{"index" : "name", "query" : {"match" : {"_all": "on"} }, "limit": 0}'; done > /dev/null

# Capture the current time in nanoseconds after completing the curl operations
end=$(date +%s%N)

# Calculate the duration of the curl operations in milliseconds
curl=$(( (end - start) / 1000000 ))

# Capture the current time in nanoseconds before starting the MySQL operations
start=$(date +%s%N)

# Execute a MySQL query 100 times in a loop, discard the output
for n in `seq 1 100`; do mysql -P9306 -h0 -e "select * from name where match('on'); show meta;"; done > /dev/null

# Capture the current time in nanoseconds after completing the MySQL operations
end=$(date +%s%N)

# Calculate the duration of the MySQL operations in milliseconds
mysql=$(( (end - start) / 1000000 ))

# Print the durations of MySQL and curl operations in milliseconds
echo "Duration: ${mysql} ${curl} ms"

# Use PHP to calculate the percentage difference between the curl and MySQL durations and check if it's >= 1%
# Exits with 0 if the condition is true, otherwise exits with 1
php -r "exit(round(abs($curl - $mysql) / ($curl + $mysql)/2 * 100, 0) >= 1);"

# Print the exit status of the PHP script (0 if percentage difference >= 1%, 1 otherwise)
echo $?

Result on dev2:

Duration: 7847 7782 ms
0

Please create a separate task for yourself about it and put a link here.

PavelShilin89 commented 8 months ago

I'm doing it in https://github.com/orgs/manticoresoftware/projects/3/views/15?pane=issue&itemId=55991220