project-lux / lux-marklogic

Code, issues, and resources related to LUX MarkLogic
Other
3 stars 2 forks source link

Investigate slow load into SBX on 2 Jul 24 #218

Open brent-hartwig opened 2 weeks ago

brent-hartwig commented 2 weeks ago

Some data loads have started to experience a rate slow down. At first it seemed to be 2 to 2.5 hours into the load, possibly in environments with forest-level replication enabled. More recently it has happened after four hours (91% - 93% complete), in SBX without forest-level replication. I created this ticket as a dumping ground for data collected from SBX's 2 Jul 24 load.

Characteristic Value
Start Time (UTC) 7/2/2024 13:23
End Time (UTC) 7/2/2024 18:50
Total Execution Time (seconds) 19857
Target Env SBX
ML Version 11.2.0
Forests Replicated? No
Source Machine Pipeline
Java Version 11.0.23
MLCP Version 11.2.0
Thread Count 64
Batch Size 50
Slowdown Percent 91
Slowdown Time 7/2/2024 18:08
Duration before Slowdown 4:44:57
brent-hartwig commented 2 weeks ago

ML Load Balancer Request Rate

Requests dropped off around 1800 UTC, implying but not confirming the slow down is on the requesting side.

image

brent-hartwig commented 2 weeks ago

ML Logs

Documents were loaded via port 8000.

Request logs are not enabled for port 8000. They were of interest as they capture request durations, and perhaps could have reflected whether ML was taking longer to process requests after the slow down.

Access logs are available. The logs are too massive to attach here (1.1 GB compressed / 14.8 GB uncompressed). Stats from access logs trimmed to 13:15 through 19:00 follow. Each PUT /insert is a document insert. Not sure what the POST /eval requests are. Access logs do not include durations. Not sure how informative 41m timestamps are.

Node All Requests All % PUTs PUTs % POSTs POSTs %
103 13,997,065 32.92% 13,722,522 32.92% 274,543 32.92%
16 14,575,613 34.29% 14,289,727 34.29% 285,886 34.28%
57 13,940,089 32.79% 13,666,665 32.79% 273,424 32.79%
42,512,767 100.00% 41,678,914 100.00% 833,853 100.00%

I'm not sure why node 16 processed an additional ~500K requests but all nodes finished at the same time:

$ tail -n 2 *
==> 103-access-trimmed.txt <==
10.5.156.59 - lux_deployer_sbx [02/Jul/2024:18:54:14 +0000] "PUT /insert?uri=https%3A%2F%2Flux.collections.yale.edu%2Fdata%2Fobject%2Fffffe2f6-5fd9-4956-918c-f05528329303&errok&format=json&perm=Rlux-endpoint-consumer&perm=Ulux-writer&locale=en&tzoffset=0&dbid=11710160318236374257&txnid=c81d6ae9c453dc10&txnmode=update HTTP/1.1" 200 0 - "Java/11.0.23 MarkLogicXCC/11.2-0"
10.5.156.59 - lux_deployer_sbx [02/Jul/2024:18:54:14 +0000] "POST /eval HTTP/1.1" 200 0 - "Java/11.0.23 MarkLogicXCC/11.2-0"

==> 16-access-trimmed.txt <==
10.5.156.59 - lux_deployer_sbx [02/Jul/2024:18:54:13 +0000] "PUT /insert?uri=https%3A%2F%2Flux.collections.yale.edu%2Fdata%2Fobject%2Ffffc6786-6f60-4d05-934d-be4ce2236ecb&errok&format=json&perm=Rlux-endpoint-consumer&perm=Ulux-writer&locale=en&tzoffset=0&dbid=4546125665420467413&txnid=9aaaa555e5b07eb9&txnmode=update HTTP/1.1" 200 0 - "Java/11.0.23 MarkLogicXCC/11.2-0"
10.5.156.59 - lux_deployer_sbx [02/Jul/2024:18:54:13 +0000] "POST /eval HTTP/1.1" 200 0 - "Java/11.0.23 MarkLogicXCC/11.2-0"

==> 57-access-trimmed.txt <==
10.5.156.59 - lux_deployer_sbx [02/Jul/2024:18:54:14 +0000] "PUT /insert?uri=https%3A%2F%2Flux.collections.yale.edu%2Fdata%2Fobject%2Fffff5597-e4e5-400c-b58e-122a4cdcc599&errok&format=json&perm=Rlux-endpoint-consumer&perm=Ulux-writer&locale=en&tzoffset=0&dbid=16860244288764155937&txnid=707aaff233ecb99a&txnmode=update HTTP/1.1" 200 0 - "Java/11.0.23 MarkLogicXCC/11.2-0"
10.5.156.59 - lux_deployer_sbx [02/Jul/2024:18:54:14 +0000] "POST /eval HTTP/1.1" 200 0 - "Java/11.0.23 MarkLogicXCC/11.2-0"

8000's error logs are empty.

The main error logs are large and filled with debug- and info-level messages, likely mostly related to merging.

brent-hartwig commented 2 weeks ago

MLCP

Rob questioned whether MLCP is having to retry requests and whether we could compare the load balancer request count to requests processed by ML. As noted above, there were 42,512,767 requests of port 8000 between 13:15 and 19:00. One with AWS Console access would need to get the ML LB request count for the same duration.

Per ML monitoring, 8000's request queue was empty during the load. ML monitoring captures metrics every minute.

For future runs, we can also enable debug-level messages: https://docs.marklogic.com/11.0/guide/mlcp-guide/en/troubleshooting/enabling-debug-level-messages.html

brent-hartwig commented 2 weeks ago

ML Monitoring History

13:15 - 19:00

CPU

01-cpu

IO

02-io

Memory

03-memory

XDQP (intra-cluster activity)

04-xdqp

HTTP Servers

05-servers

Network (more intra-cluster activity)

06-network

Data node characteristics

07-database

Exports:

01-cpu.xls 02-io.xls 03-memory.xls 04-xdqp.xls 05-servers.xls 06-network.xls 07-database.xls