Debug fetch-files S3 flakiness

peetucket commented 1 week ago

The speech to text wf step which sends files to s3 (fetch-files) often causes AWS exceptions, which go away on retries. Need to debug.

See https://app.honeybadger.io/projects/52894/faults/113690502

peetucket commented 4 days ago

Hmm... i tried a few times (even sending all the files, not just two), on the robot console, using the same code the fetch-files robot uses (it's basically this method https://github.com/sul-dlss/common-accessioning/blob/main/lib/robots/dor_repo/speech_to_text/fetch_files.rb#L13-L21), and can't seem to reproduce this error.

ssh common-accessioning-qa-b.stanford.edu
cd common-accessioning/current
ROBOT_ENVIRONMENT=production bin/console

druid='druid:xy788dm4642'

object_client = Dor::Services::Client.object(druid);cocina_object = object_client.find;druid_object = DruidTools::Druid.new(druid, Settings.stacks.local_workspace_root);speech_to_text = Dor::TextExtraction::SpeechToText.new(cocina_object:);file_fetcher = Dor::TextExtraction::FileFetcher.new(druid:, logger: nil);aws_provider ||= Dor::TextExtraction::AwsProvider.new(region: Settings.aws.region, access_key_id: Settings.aws.access_key_id, secret_access_key: Settings.aws.secret_access_key);nil

files = Dir.glob('/dor/assembly/xy/788/dm/4642/xy788dm4642/content/*').map{|f| f.delete_prefix('/dor/assembly/xy/788/dm/4642/xy788dm4642/content/')}
=> ["video_1.mp4", "video_1.mpeg", "video_1_thumb.jp2", "video_2.mp4", "video_2.mpeg", "video_2_thumb.jp2", "video_log.txt"]

files.each do |filename|
   raise "Unable to fetch #{filename} for #{druid}" unless file_fetcher.write_file_with_retries(
     filename:,
    location: aws_provider.bucket.object(speech_to_text.s3_location(filename)),
    max_tries: 3)
end

I, [2024-10-22T16:16:02.060545 #981603]  INFO -- : fetching video_1.mp4 for druid:xy788dm4642 and sending to sul-speech-to-text-staging
I, [2024-10-22T16:16:02.500834 #981603]  INFO -- : fetching video_1.mpeg for druid:xy788dm4642 and sending to sul-speech-to-text-staging
I, [2024-10-22T16:16:02.990891 #981603]  INFO -- : fetching video_1_thumb.jp2 for druid:xy788dm4642 and sending to sul-speech-to-text-staging
I, [2024-10-22T16:16:03.319087 #981603]  INFO -- : fetching video_2.mp4 for druid:xy788dm4642 and sending to sul-speech-to-text-staging
I, [2024-10-22T16:16:03.618308 #981603]  INFO -- : fetching video_2.mpeg for druid:xy788dm4642 and sending to sul-speech-to-text-staging
I, [2024-10-22T16:16:04.117765 #981603]  INFO -- : fetching video_2_thumb.jp2 for druid:xy788dm4642 and sending to sul-speech-to-text-staging
I, [2024-10-22T16:16:04.455584 #981603]  INFO -- : fetching video_log.txt for druid:xy788dm4642 and sending to sul-speech-to-text-staging
=> ["video_1.mp4", "video_1.mpeg", "video_1_thumb.jp2", "video_2.mp4", "video_2.mpeg", "video_2_thumb.jp2", "video_log.txt"]

peetucket commented 2 days ago

Similarly, running the robot on the console also works as expected:

ssh common-accessioning-qa-b.stanford.edu
cd common-accessioning/current

bin/run_robot --druid druid:xy788dm4642 --environment production SpeechToText::FetchFiles

2024-10-24T20:55:56.807Z pid=1050132 tid=ml2s INFO: druid:xy788dm4642 processing fetch-files (speechToTextWF)
2024-10-24T20:55:57.294Z pid=1050132 tid=ml2s INFO: fetching video_1.mp4 for druid:xy788dm4642 and sending to sul-speech-to-text-staging
2024-10-24T20:55:57.726Z pid=1050132 tid=ml2s INFO: fetching video_2.mp4 for druid:xy788dm4642 and sending to sul-speech-to-text-staging
2024-10-24T20:55:58.073Z pid=1050132 tid=ml2s INFO: Finished druid:xy788dm4642 in 1.2171s

Tried a few times in a row... no errors

peetucket commented 2 days ago

Try resetting the step to force sidekiq to run:

ssh common-accessioning-qa-b.stanford.edu
cd common-accessioning/current

workflow_client = LyberCore::WorkflowClientFactory.build(logger: nil);nil
workflow_client.update_status(druid: 'druid:xy788dm4642', workflow: 'speechToTextWF', process: 'fetch-files', status: 'waiting')

peetucket commented 1 day ago

Seems to work every time when the robot is run on the console or the code is executed on the console. But fails when run by sidekiq periodically. This may be a clue (possibly a coincidence, but I have never seen it fail except when run by sidekiq)

sul-dlss / common-accessioning

Debug fetch-files S3 flakiness #1392