sul-dlss-deprecated / dor_indexing_app

An indexing API for Stanford's Digital Object Repository
https://sul-dlss-deprecated.github.io/dor_indexing_app/
Apache License 2.0
0 stars 2 forks source link

Rolling re-indexing: improve throughput #1082

Closed ndushay closed 10 months ago

ndushay commented 10 months ago

See https://docs.google.com/document/d/1B61r-E9v2WhYQ_ABP-RTX61xfbEQ25H8Sm0Wwu5zENA for analysis of the problem and first steps taken.

At this time, our settings are:

I think the biggest gains will be from:

The risks of experiments are low:

ndushay commented 10 months ago

Configuring pause between batches

Factors:

Data:

with softCommit max time of 10 seconds (sul-solr-configs)

ndushay commented 10 months ago

Where do we spend time:

In building the Solr documents from druids:

2024-01-09 11:48:10 -0800   Starting batch of 500 documents
    Got oldest ids from Solr in 0.552
    Built 500 Solr docs in 218.787
2024-01-09 11:51:51 -0800   Indexed 500 documents in 221.108 (druid:bp677ms4016 (2023-11-18T00:10:45.454Z) - druid:sq526ps5590 (2023-11-18T00:17:48.878Z))
2024-01-09 11:56:13 -0800       Starting batch of 500 documents
        Got oldest ids from Solr in 0.548
        Solr doc druid:bn979dz3952 requested from DSA and built in 0.307
        Solr doc druid:bq235nz7676 requested from DSA and built in 0.234
        ...
        Built 500 Solr docs in 215.901
2024-01-09 11:59:50 -0800       Indexed 500 documents in 216.472 (druid:bn979dz3952 (2023-11-18T00:17:48.873Z) - druid:bn221nw9016 (2023-11-18T00:24:51.093Z))

Is it from network time or processing time to build them?

small retrieval times:

    Solr doc druid:dn721xg0645 retrieved from DSA in 0.03
    Solr doc druid:hc626cc6098 retrieved from DSA in 0.03

large retrieval times:

    Solr doc druid:bn026fk3330 retrieved from DSA in 0.223
    Solr doc druid:bp213pp4137 retrieved from DSA in 0.274

small build times:

    Solr doc druid:dn720ys4629 requested from DSA and built in 0.154
    Solr doc druid:bj650dj0853 requested from DSA and built in 0.159
    Solr doc druid:cn258nd0182 doc built in 0.105
    Solr doc druid:bp259tp8127 doc built in 0.106

large build times:

    Solr doc druid:bp213pp4137 requested from DSA and built in 0.511
    Solr doc druid:bq701fd6543 requested from DSA and built in 0.473
    Solr doc druid:hc646zy6262 doc built in 0.377
    Solr doc druid:sy440rk8925 doc built in 0.302
    Solr doc druid:dn720ys4629 doc built in 0.295
ndushay commented 10 months ago

Just spoke with JLitt about next steps. Since the bulk of time for a batch is spent building the solr doc (and for large docs, retrieving the cocina), we are wondering how to speed that up. For the cocina retrieval, avoiding the network call to DSA would help. For indexing, there is also a call to the workflow service that we can get some data on.

Next steps:

  1. get some data on the workflow retrieval times
  2. reduce the time between documents from 0.2 to 0.1 (500 doc batch times are 50 seconds faster)
  3. decrease the softCommit maxTime to 5s and the wait between batches to 5.1 secs and see what happens
  4. (eventually) try increasing the batch size

But all of the above seems really small compared to it costing between 0.15 and 0.5 seconds to retrieve and build each object in a batch from DSA and WFS.

For example: batches take around 3.5 minutes to run. Let's round that up to 240 seconds (4 min). A 10 second pause between batches is 4% of the total. Building each solr doc in the batch is taking up the bulk of the time.

Results

about a 50 second reduction for batch time.

2024-01-09 14:35:43 -0800   Starting batch of 500 documents
    Got oldest ids from Solr in 0.547
    Built 500 Solr docs in 166.373
2024-01-09 14:38:32 -0800   Indexed 500 documents in 168.599 (druid:bm461jx4964 (2023-11-18T02:51:40.897Z) - druid:kk200jg8217 (2023-11-18T02:58:38.813Z))
2024-01-09 14:38:37 -0800   Starting batch of 500 documents
    Got oldest ids from Solr in 0.568
    Built 500 Solr docs in 169.322
2024-01-09 14:41:29 -0800   Indexed 500 documents in 171.635 (druid:bq288hd9882 (2023-11-18T02:58:38.813Z) - druid:kz460jq7046 (2023-11-18T03:05:41.246Z))
2024-01-09 14:41:34 -0800   Starting batch of 500 documents
    Got oldest ids from Solr in 0.557
    Built 500 Solr docs in 169.073
2024-01-09 14:44:25 -0800   Indexed 500 documents in 171.362 (druid:bm681rn1581 (2023-11-18T03:05:41.246Z) - druid:bq930db5475 (2023-11-18T03:12:44.681Z))
2024-01-09 14:44:30 -0800   Starting batch of 500 documents
    Got oldest ids from Solr in 0.523
    Built 500 Solr docs in 171.449
2024-01-09 14:47:24 -0800   Indexed 500 documents in 173.821 (druid:xw918yy2280 (2023-11-18T03:12:44.681Z) - druid:cp086pp3847 (2023-11-18T03:19:56.908Z))
2024-01-09 14:47:29 -0800   Starting batch of 500 documents
    Got oldest ids from Solr in 0.558
    Built 500 Solr docs in 173.307
2024-01-09 14:50:25 -0800   Indexed 500 documents in 175.646 (druid:cp088bg0757 (2023-11-18T03:19:56.913Z) - druid:hx719jn8800 (2023-11-18T03:27:03.263Z))
ndushay commented 10 months ago

Workflow retrievial and indexing times:

FAST. hundreths of a second for retrieval and building the solr doc fields

     retrieving all_workflows for druid:bm280jb5000 took 0.016 seconds
     building workflows for druid:bm280jb5000 took 0.048 seconds
     retrieving all_workflows for druid:bm210yr2664 took 0.017 seconds
     building workflows for druid:bm210yr2664 took 0.055 seconds
     retrieving all_workflows for druid:fh718rb8877 took 0.011 seconds
     building workflows for druid:fh718rb8877 took 0.047 seconds
ndushay commented 10 months ago

How Long Does Full Reindex of Argo Take?

One batch of 500 docs takes less than 3 min. There are 1440 minutes in a day. 1440 / 3 = 480 batches per day 480 * 500 = 240,000 docs per day 5,235,000 docs in SDR / 240,000 docs per day = about 22 days

ndushay commented 10 months ago

Where is the most time spent for reindexing?

Creating Solr documents from cocina. Sometimes in retrieving cocina documents.

Remaining To Do

  1. Consider moving DIA code to DSA to speed up indexing.
    • [ ] JLitt is stewing about the best way to do this (make it a gem?)
  2. Finalize the latest solrconfig.xml changes
    • [x] PR Merged to master branch
    • [x] Apply to prod, qa, stage
    • [x] Change solrconfig.xml in Argo local repo solr sul-dlss/argo/pull/4305
    • [x] ditto DSA sul-dlss/dor-services-app/pull/4674
  3. Finalize the latest DIA changes
    • [x] PR merged to master
    • [x] Deployed to prod, stage, qa
  4. Increase the batch size?
ndushay commented 10 months ago

JLitt did a spike on rolling index functionality in DSA and it more than halves the time per batch:

2024-01-10 10:26:03 -0800       Indexed 500 documents in 84.539
2024-01-10 10:27:25 -0800       Indexed 500 documents in 81.958
2024-01-10 10:28:48 -0800       Indexed 500 documents in 83.142

vs

2024-01-10 10:44:21 -0800   Indexed 500 documents in 188.683 (druid:st236kb2146 (2023-11-20T00:10:40.770Z) - druid:rd123yh6977 (2023-11-20T00:17:47.668Z))
2024-01-10 10:47:35 -0800   Indexed 500 documents in 188.691 (druid:vr803zc9077 (2023-11-20T00:17:47.668Z) - druid:sv636vf9983 (2023-11-20T00:24:56.168Z))
2024-01-10 10:50:45 -0800   Indexed 500 documents in 184.791 (druid:vs279py5646 (2023-11-20T00:24:56.168Z) - druid:vw256xg3924 (2023-11-20T00:32:05.945Z))

Next steps will not be in this ticket