Closed ndushay closed 10 months ago
Factors:
with softCommit max time of 10 seconds (sul-solr-configs)
2024-01-09 11:32:08 -0800 Indexed 500 documents in 226.001 (druid:bj567sc8086 (2023-11-18T00:03:44.088Z) - druid:bm260mf0815 (2023-11-18T00:10:45.454Z))
2024-01-09 11:35:54 -0800 Indexed 500 documents in 220.543 (druid:bj567sc8086 (2023-11-18T00:03:44.082Z) - druid:bm260mf0815 (2023-11-18T00:10:45.448Z))
2024-01-09 11:48:02 -0800 Indexed 500 documents in 224.298 (druid:bp677ms4016 (2023-11-18T00:10:45.454Z) - druid:sq526ps5590 (2023-11-18T00:17:48.878Z))
2024-01-09 11:51:51 -0800 Indexed 500 documents in 221.108 (druid:bp677ms4016 (2023-11-18T00:10:45.454Z) - druid:sq526ps5590 (2023-11-18T00:17:48.878Z))
In building the Solr documents from druids:
2024-01-09 11:48:10 -0800 Starting batch of 500 documents
Got oldest ids from Solr in 0.552
Built 500 Solr docs in 218.787
2024-01-09 11:51:51 -0800 Indexed 500 documents in 221.108 (druid:bp677ms4016 (2023-11-18T00:10:45.454Z) - druid:sq526ps5590 (2023-11-18T00:17:48.878Z))
2024-01-09 11:56:13 -0800 Starting batch of 500 documents
Got oldest ids from Solr in 0.548
Solr doc druid:bn979dz3952 requested from DSA and built in 0.307
Solr doc druid:bq235nz7676 requested from DSA and built in 0.234
...
Built 500 Solr docs in 215.901
2024-01-09 11:59:50 -0800 Indexed 500 documents in 216.472 (druid:bn979dz3952 (2023-11-18T00:17:48.873Z) - druid:bn221nw9016 (2023-11-18T00:24:51.093Z))
small retrieval times:
Solr doc druid:dn721xg0645 retrieved from DSA in 0.03
Solr doc druid:hc626cc6098 retrieved from DSA in 0.03
large retrieval times:
Solr doc druid:bn026fk3330 retrieved from DSA in 0.223
Solr doc druid:bp213pp4137 retrieved from DSA in 0.274
small build times:
Solr doc druid:dn720ys4629 requested from DSA and built in 0.154
Solr doc druid:bj650dj0853 requested from DSA and built in 0.159
Solr doc druid:cn258nd0182 doc built in 0.105
Solr doc druid:bp259tp8127 doc built in 0.106
large build times:
Solr doc druid:bp213pp4137 requested from DSA and built in 0.511
Solr doc druid:bq701fd6543 requested from DSA and built in 0.473
Solr doc druid:hc646zy6262 doc built in 0.377
Solr doc druid:sy440rk8925 doc built in 0.302
Solr doc druid:dn720ys4629 doc built in 0.295
Just spoke with JLitt about next steps. Since the bulk of time for a batch is spent building the solr doc (and for large docs, retrieving the cocina), we are wondering how to speed that up. For the cocina retrieval, avoiding the network call to DSA would help. For indexing, there is also a call to the workflow service that we can get some data on.
But all of the above seems really small compared to it costing between 0.15 and 0.5 seconds to retrieve and build each object in a batch from DSA and WFS.
For example: batches take around 3.5 minutes to run. Let's round that up to 240 seconds (4 min). A 10 second pause between batches is 4% of the total. Building each solr doc in the batch is taking up the bulk of the time.
about a 50 second reduction for batch time.
2024-01-09 14:35:43 -0800 Starting batch of 500 documents
Got oldest ids from Solr in 0.547
Built 500 Solr docs in 166.373
2024-01-09 14:38:32 -0800 Indexed 500 documents in 168.599 (druid:bm461jx4964 (2023-11-18T02:51:40.897Z) - druid:kk200jg8217 (2023-11-18T02:58:38.813Z))
2024-01-09 14:38:37 -0800 Starting batch of 500 documents
Got oldest ids from Solr in 0.568
Built 500 Solr docs in 169.322
2024-01-09 14:41:29 -0800 Indexed 500 documents in 171.635 (druid:bq288hd9882 (2023-11-18T02:58:38.813Z) - druid:kz460jq7046 (2023-11-18T03:05:41.246Z))
2024-01-09 14:41:34 -0800 Starting batch of 500 documents
Got oldest ids from Solr in 0.557
Built 500 Solr docs in 169.073
2024-01-09 14:44:25 -0800 Indexed 500 documents in 171.362 (druid:bm681rn1581 (2023-11-18T03:05:41.246Z) - druid:bq930db5475 (2023-11-18T03:12:44.681Z))
2024-01-09 14:44:30 -0800 Starting batch of 500 documents
Got oldest ids from Solr in 0.523
Built 500 Solr docs in 171.449
2024-01-09 14:47:24 -0800 Indexed 500 documents in 173.821 (druid:xw918yy2280 (2023-11-18T03:12:44.681Z) - druid:cp086pp3847 (2023-11-18T03:19:56.908Z))
2024-01-09 14:47:29 -0800 Starting batch of 500 documents
Got oldest ids from Solr in 0.558
Built 500 Solr docs in 173.307
2024-01-09 14:50:25 -0800 Indexed 500 documents in 175.646 (druid:cp088bg0757 (2023-11-18T03:19:56.913Z) - druid:hx719jn8800 (2023-11-18T03:27:03.263Z))
FAST. hundreths of a second for retrieval and building the solr doc fields
retrieving all_workflows for druid:bm280jb5000 took 0.016 seconds
building workflows for druid:bm280jb5000 took 0.048 seconds
retrieving all_workflows for druid:bm210yr2664 took 0.017 seconds
building workflows for druid:bm210yr2664 took 0.055 seconds
retrieving all_workflows for druid:fh718rb8877 took 0.011 seconds
building workflows for druid:fh718rb8877 took 0.047 seconds
One batch of 500 docs takes less than 3 min. There are 1440 minutes in a day. 1440 / 3 = 480 batches per day 480 * 500 = 240,000 docs per day 5,235,000 docs in SDR / 240,000 docs per day = about 22 days
Creating Solr documents from cocina. Sometimes in retrieving cocina documents.
JLitt did a spike on rolling index functionality in DSA and it more than halves the time per batch:
2024-01-10 10:26:03 -0800 Indexed 500 documents in 84.539
2024-01-10 10:27:25 -0800 Indexed 500 documents in 81.958
2024-01-10 10:28:48 -0800 Indexed 500 documents in 83.142
vs
2024-01-10 10:44:21 -0800 Indexed 500 documents in 188.683 (druid:st236kb2146 (2023-11-20T00:10:40.770Z) - druid:rd123yh6977 (2023-11-20T00:17:47.668Z))
2024-01-10 10:47:35 -0800 Indexed 500 documents in 188.691 (druid:vr803zc9077 (2023-11-20T00:17:47.668Z) - druid:sv636vf9983 (2023-11-20T00:24:56.168Z))
2024-01-10 10:50:45 -0800 Indexed 500 documents in 184.791 (druid:vs279py5646 (2023-11-20T00:24:56.168Z) - druid:vw256xg3924 (2023-11-20T00:32:05.945Z))
Next steps will not be in this ticket
See https://docs.google.com/document/d/1B61r-E9v2WhYQ_ABP-RTX61xfbEQ25H8Sm0Wwu5zENA for analysis of the problem and first steps taken.
At this time, our settings are:
I think the biggest gains will be from:
The risks of experiments are low: