pulibrary / figgy

Valkyrie-based digital repository backend.
Other
36 stars 4 forks source link

Save and Ingest is taking Figgy down. #6395

Closed tpendragon closed 4 months ago

tpendragon commented 4 months ago

Problem

It's taking 5-8 minutes per call to find the save and ingest path. Today we had downtime because 24 of these came in during a 5 minute timespan, resulting in all the available Passenger threads being used up and Figgy becoming unavailable.

We should see if we can get the time to traverse the directory way down. Below are some benchmarks of various methods:

Success Criteria

The time to call #save_and_ingest is cut in half, at least, and many save_and_ingest calls at once don't crash Figgy.

Implementation Details

None of the benchmarks below are fast enough, so two step process:

  1. Check the resource allocation on the web boxes, see if we can increase the number of processes per box.

  2. Add a cache mechanism that creates a hash of base paths to directories for the whole directory structure, like { "123" => ["/mnt/hydra_sources/studio_new/DPUL/123"]} and stores it at a key with Rails.cache. Then the IngestFolderLocator should check that cache, verify that the folder that it gets back still exists, and return it if so. If it either has no result, or doesn't exist, it should then go re-warm the cache and try again.

  3. Bust and regenerate the cache every once in a while, somehow. Cron job maybe.

Benchmarks

Dir.glob

2 minutes, 4 seconds

irb(main):002:1* Benchmark.measure do
irb(main):003:2*   Dir.glob("/mnt/hydra_sources/studio_new/dpul/**/*").find do |f|
irb(main):004:2*     f.to_s.end_with?("9968220413506421")
irb(main):005:1*   end
irb(main):006:0> end
=> #<Benchmark::Tms:0x00007ff1dbfcd3f0 @cstime=0.0, @cutime=0.0, @label="", @real=123.4377756959293, @stime=6.550639, @total=8.141238, @utime=1.5905989999999992>

Find.find without a directory test

2 minutes, 13 seconds

irb(main):008:1* Benchmark.measure do
irb(main):009:2*   Find.find("/mnt/hydra_sources/studio_new/dpul").
find do |path|
irb(main):010:2*     path.end_with? == "9968220413506421"
irb(main):011:1*   end
irb(main):012:0> end
=>
#<Benchmark::Tms:0x00007ff1e15a7898
 @cstime=0.0,
 @cutime=0.0,
 @label="",
 @real=133.66980222892016,
 @stime=7.694535999999999,
 @total=13.035001,
 @utime=5.340465>

Existing Implementation

2 minutes, 10 seconds.

irb(main):013:1* Benchmark.measure do
irb(main):014:2*   Find.find("/mnt/hydra_sources/studio_new/dpul").find do |path|
irb(main):015:2*     FileTest.directory?(path) && path.split("/").last == "9968220413506421"
irb(main):016:1*   end
irb(main):017:0> end
=>
#<Benchmark::Tms:0x00007ff1e09734a0
 @cstime=0.0,
 @cutime=0.0,
 @label="",
 @real=130.30044317501597,
 @stime=9.830485,
 @total=15.917918,
 @utime=6.087433000000001>
irb(main):018:0>

There might not be a way to make this faster..

Sudden Priority Justification

This has taken us away from what we were working on three times in the last week.

hackartisan commented 4 months ago

I tried another one, no improvement (update: this does find a directory if it's there, though, and I think we should switch to it just because it's simpler than the find loop)

irb(main):010:1* Benchmark.measure do
irb(main):011:1*   d = Dir.glob("/mnt/hydra_sources/studio_new/dpul/**/9968220413506421")
irb(main):012:0> end
=> #<Benchmark::Tms:0x00007f540b3d3400 @cstime=0.0, @cutime=0.0, @label="", @real=128.38275689515285, @stime=4.813298, @total=5.91087, @utime=1.0975720000000004>
hackartisan commented 4 months ago

During another figgy downtime event we went to a web prod box and looked at the current passenger requests.

first run sudo passenger-status to ensure passenger is running healthy. It should show 5 pids.

then run sudo passenger-status --show=requests > passenger_status_[timestamp].txt

if you look in that file, the currently active requests have "initialized" : true for a couple of the values. The 5 of them that are currently active are all save_and_ingest paths.

hackartisan commented 4 months ago

Note that the create form checks for save and ingest for any new resource, whether or not the ingester is actually intending to use it. A fix could involve changing the way that page pre-loads the path.

tpendragon commented 4 months ago

https://github.com/pulibrary/figgy/pull/6397 has a proof of concept for a change to the "Save and Ingest" feature which instead requires a button click to initiate the search, now that it's too expensive to do on input change.

tpendragon commented 4 months ago

We're going to close this ticket with #6397. We suspect if it takes 2 minutes to find a folder, save and ingest probably isn't an effective feature and in the future it might be better to remove it and just add a directory picker.