Open brent-hartwig opened 2 weeks ago
Requests dropped off around 1800 UTC, implying but not confirming the slow down is on the requesting side.
Documents were loaded via port 8000.
Request logs are not enabled for port 8000. They were of interest as they capture request durations, and perhaps could have reflected whether ML was taking longer to process requests after the slow down.
Access logs are available. The logs are too massive to attach here (1.1 GB compressed / 14.8 GB uncompressed). Stats from access logs trimmed to 13:15 through 19:00 follow. Each PUT /insert
is a document insert. Not sure what the POST /eval
requests are. Access logs do not include durations. Not sure how informative 41m timestamps are.
Node | All Requests | All % | PUTs | PUTs % | POSTs | POSTs % |
---|---|---|---|---|---|---|
103 | 13,997,065 | 32.92% | 13,722,522 | 32.92% | 274,543 | 32.92% |
16 | 14,575,613 | 34.29% | 14,289,727 | 34.29% | 285,886 | 34.28% |
57 | 13,940,089 | 32.79% | 13,666,665 | 32.79% | 273,424 | 32.79% |
42,512,767 | 100.00% | 41,678,914 | 100.00% | 833,853 | 100.00% |
I'm not sure why node 16 processed an additional ~500K requests but all nodes finished at the same time:
$ tail -n 2 *
==> 103-access-trimmed.txt <==
10.5.156.59 - lux_deployer_sbx [02/Jul/2024:18:54:14 +0000] "PUT /insert?uri=https%3A%2F%2Flux.collections.yale.edu%2Fdata%2Fobject%2Fffffe2f6-5fd9-4956-918c-f05528329303&errok&format=json&perm=Rlux-endpoint-consumer&perm=Ulux-writer&locale=en&tzoffset=0&dbid=11710160318236374257&txnid=c81d6ae9c453dc10&txnmode=update HTTP/1.1" 200 0 - "Java/11.0.23 MarkLogicXCC/11.2-0"
10.5.156.59 - lux_deployer_sbx [02/Jul/2024:18:54:14 +0000] "POST /eval HTTP/1.1" 200 0 - "Java/11.0.23 MarkLogicXCC/11.2-0"
==> 16-access-trimmed.txt <==
10.5.156.59 - lux_deployer_sbx [02/Jul/2024:18:54:13 +0000] "PUT /insert?uri=https%3A%2F%2Flux.collections.yale.edu%2Fdata%2Fobject%2Ffffc6786-6f60-4d05-934d-be4ce2236ecb&errok&format=json&perm=Rlux-endpoint-consumer&perm=Ulux-writer&locale=en&tzoffset=0&dbid=4546125665420467413&txnid=9aaaa555e5b07eb9&txnmode=update HTTP/1.1" 200 0 - "Java/11.0.23 MarkLogicXCC/11.2-0"
10.5.156.59 - lux_deployer_sbx [02/Jul/2024:18:54:13 +0000] "POST /eval HTTP/1.1" 200 0 - "Java/11.0.23 MarkLogicXCC/11.2-0"
==> 57-access-trimmed.txt <==
10.5.156.59 - lux_deployer_sbx [02/Jul/2024:18:54:14 +0000] "PUT /insert?uri=https%3A%2F%2Flux.collections.yale.edu%2Fdata%2Fobject%2Fffff5597-e4e5-400c-b58e-122a4cdcc599&errok&format=json&perm=Rlux-endpoint-consumer&perm=Ulux-writer&locale=en&tzoffset=0&dbid=16860244288764155937&txnid=707aaff233ecb99a&txnmode=update HTTP/1.1" 200 0 - "Java/11.0.23 MarkLogicXCC/11.2-0"
10.5.156.59 - lux_deployer_sbx [02/Jul/2024:18:54:14 +0000] "POST /eval HTTP/1.1" 200 0 - "Java/11.0.23 MarkLogicXCC/11.2-0"
8000's error logs are empty.
The main error logs are large and filled with debug- and info-level messages, likely mostly related to merging.
Rob questioned whether MLCP is having to retry requests and whether we could compare the load balancer request count to requests processed by ML. As noted above, there were 42,512,767 requests of port 8000 between 13:15 and 19:00. One with AWS Console access would need to get the ML LB request count for the same duration.
Per ML monitoring, 8000's request queue was empty during the load. ML monitoring captures metrics every minute.
For future runs, we can also enable debug-level messages: https://docs.marklogic.com/11.0/guide/mlcp-guide/en/troubleshooting/enabling-debug-level-messages.html
13:15 - 19:00
CPU
IO
Memory
XDQP (intra-cluster activity)
HTTP Servers
Network (more intra-cluster activity)
Data node characteristics
Exports:
01-cpu.xls 02-io.xls 03-memory.xls 04-xdqp.xls 05-servers.xls 06-network.xls 07-database.xls
Some data loads have started to experience a rate slow down. At first it seemed to be 2 to 2.5 hours into the load, possibly in environments with forest-level replication enabled. More recently it has happened after four hours (91% - 93% complete), in SBX without forest-level replication. I created this ticket as a dumping ground for data collected from SBX's 2 Jul 24 load.