Investigate slow load into SBX on 2 Jul 24

brent-hartwig commented 2 weeks ago

Some data loads have started to experience a rate slow down. At first it seemed to be 2 to 2.5 hours into the load, possibly in environments with forest-level replication enabled. More recently it has happened after four hours (91% - 93% complete), in SBX without forest-level replication. I created this ticket as a dumping ground for data collected from SBX's 2 Jul 24 load.

Characteristic	Value
Start Time (UTC)	7/2/2024 13:23
End Time (UTC)	7/2/2024 18:50
Total Execution Time (seconds)	19857
Target Env	SBX
ML Version	11.2.0
Forests Replicated?	No
Source Machine	Pipeline
Java Version	11.0.23
MLCP Version	11.2.0
Thread Count	64
Batch Size	50
Slowdown Percent	91
Slowdown Time	7/2/2024 18:08
Duration before Slowdown	4:44:57

brent-hartwig commented 2 weeks ago

ML Load Balancer Request Rate

Requests dropped off around 1800 UTC, implying but not confirming the slow down is on the requesting side.

brent-hartwig commented 2 weeks ago

ML Logs

Documents were loaded via port 8000.

Request logs are not enabled for port 8000. They were of interest as they capture request durations, and perhaps could have reflected whether ML was taking longer to process requests after the slow down.

Access logs are available. The logs are too massive to attach here (1.1 GB compressed / 14.8 GB uncompressed). Stats from access logs trimmed to 13:15 through 19:00 follow. Each PUT /insert is a document insert. Not sure what the POST /eval requests are. Access logs do not include durations. Not sure how informative 41m timestamps are.

Node	All Requests	All %	PUTs	PUTs %	POSTs	POSTs %
103	13,997,065	32.92%	13,722,522	32.92%	274,543	32.92%
16	14,575,613	34.29%	14,289,727	34.29%	285,886	34.28%
57	13,940,089	32.79%	13,666,665	32.79%	273,424	32.79%
	42,512,767	100.00%	41,678,914	100.00%	833,853	100.00%

I'm not sure why node 16 processed an additional ~500K requests but all nodes finished at the same time:

$ tail -n 2 *
==> 103-access-trimmed.txt <==
10.5.156.59 - lux_deployer_sbx [02/Jul/2024:18:54:14 +0000] "PUT /insert?uri=https%3A%2F%2Flux.collections.yale.edu%2Fdata%2Fobject%2Fffffe2f6-5fd9-4956-918c-f05528329303&errok&format=json&perm=Rlux-endpoint-consumer&perm=Ulux-writer&locale=en&tzoffset=0&dbid=11710160318236374257&txnid=c81d6ae9c453dc10&txnmode=update HTTP/1.1" 200 0 - "Java/11.0.23 MarkLogicXCC/11.2-0"
10.5.156.59 - lux_deployer_sbx [02/Jul/2024:18:54:14 +0000] "POST /eval HTTP/1.1" 200 0 - "Java/11.0.23 MarkLogicXCC/11.2-0"

==> 16-access-trimmed.txt <==
10.5.156.59 - lux_deployer_sbx [02/Jul/2024:18:54:13 +0000] "PUT /insert?uri=https%3A%2F%2Flux.collections.yale.edu%2Fdata%2Fobject%2Ffffc6786-6f60-4d05-934d-be4ce2236ecb&errok&format=json&perm=Rlux-endpoint-consumer&perm=Ulux-writer&locale=en&tzoffset=0&dbid=4546125665420467413&txnid=9aaaa555e5b07eb9&txnmode=update HTTP/1.1" 200 0 - "Java/11.0.23 MarkLogicXCC/11.2-0"
10.5.156.59 - lux_deployer_sbx [02/Jul/2024:18:54:13 +0000] "POST /eval HTTP/1.1" 200 0 - "Java/11.0.23 MarkLogicXCC/11.2-0"

==> 57-access-trimmed.txt <==
10.5.156.59 - lux_deployer_sbx [02/Jul/2024:18:54:14 +0000] "PUT /insert?uri=https%3A%2F%2Flux.collections.yale.edu%2Fdata%2Fobject%2Fffff5597-e4e5-400c-b58e-122a4cdcc599&errok&format=json&perm=Rlux-endpoint-consumer&perm=Ulux-writer&locale=en&tzoffset=0&dbid=16860244288764155937&txnid=707aaff233ecb99a&txnmode=update HTTP/1.1" 200 0 - "Java/11.0.23 MarkLogicXCC/11.2-0"
10.5.156.59 - lux_deployer_sbx [02/Jul/2024:18:54:14 +0000] "POST /eval HTTP/1.1" 200 0 - "Java/11.0.23 MarkLogicXCC/11.2-0"

8000's error logs are empty.

The main error logs are large and filled with debug- and info-level messages, likely mostly related to merging.

brent-hartwig commented 2 weeks ago

MLCP

Rob questioned whether MLCP is having to retry requests and whether we could compare the load balancer request count to requests processed by ML. As noted above, there were 42,512,767 requests of port 8000 between 13:15 and 19:00. One with AWS Console access would need to get the ML LB request count for the same duration.

Per ML monitoring, 8000's request queue was empty during the load. ML monitoring captures metrics every minute.

For future runs, we can also enable debug-level messages: https://docs.marklogic.com/11.0/guide/mlcp-guide/en/troubleshooting/enabling-debug-level-messages.html

brent-hartwig commented 2 weeks ago