ualbertalib / pushmi_pullyu

Ruby application to manage flow of content from Fedora into Swift for preservation
MIT License
1 stars 3 forks source link

[PMPY] PushmiPullyu::AIP::Downloader::NoFileSets issue #178

Open weiweishi opened 5 years ago

weiweishi commented 5 years ago

Some files failed the PMPY process, even after the proper wait time, and the file is available in ERA. For example, this thesis object: https://era.library.ualberta.ca/items/ccdb076d-efa9-44cc-a268-99f5018b2404 (under embargo) failed to be preserved, with the following error message:

CRITICAL - (2 errors in check_logfiles.protocol-2018-11-30-12-09-36) - E, [2018-11-30T12:06:10.019792 #31900] ERROR -- : PushmiPullyu::AIP::Downloader::NoFileSets (PushmiPullyu::AIP::Downloader::NoFileSets) ...

Detailed report is at: https://rollbar.com/ualbertalib/pushmi_pullyu/items/42/occurrences/58912276856/

Investigation is needed to figure out why PMPY failed the preservation job.

murny commented 5 years ago

Here's what the stacktrace looks like for this kinda error (and from looking at the logs all the occurrences look like this):

D, [2018-12-05T09:45:37.098568 #2945] DEBUG -- : 14d2d54f-4cd4-492f-a7bb-245d578f9da7: Creating directories ...
D, [2018-12-05T09:45:37.106585 #2945] DEBUG -- : 14d2d54f-4cd4-492f-a7bb-245d578f9da7: Creating directories done
I, [2018-12-05T09:45:37.106829 #2945]  INFO -- : 14d2d54f-4cd4-492f-a7bb-245d578f9da7: Retreiving data from Fedora ...
I, [2018-12-05T09:45:37.110371 #2945]  INFO -- : 14d2d54f-4cd4-492f-a7bb-245d578f9da7: tmp/work/14d2d54f-4cd4-492f-a7bb-245d578f9da7/data/objects/metadata/object_metadata.n3 -- fetching from http://<REDACTED_FEDORA_URL>/prod/14/d2/d5/4f/14d2d54f-4cd4-492f-a7bb-245d578f9da7 ...
I, [2018-12-05T09:45:37.453008 #2945]  INFO -- : 14d2d54f-4cd4-492f-a7bb-245d578f9da7: tmp/work/14d2d54f-4cd4-492f-a7bb-245d578f9da7/data/objects/metadata/object_metadata.n3 -- saved
I, [2018-12-05T09:45:37.453486 #2945]  INFO -- : 14d2d54f-4cd4-492f-a7bb-245d578f9da7: tmp/work/14d2d54f-4cd4-492f-a7bb-245d578f9da7/data/objects/metadata/files_metadata/file_order.xml -- creating from http://<REDACTED_FEDORA_URL>/prod/14/d2/d5/4f/14d2d54f-4cd4-492f-a7bb-245d578f9da7/list_source ...
E, [2018-12-05T09:45:37.553903 #2945] ERROR -- : [Rollbar] Error calling the `transform` hook: undefined method `[]' for #<PushmiPullyu::AIP::Downloader::NoFileSets:0x00000005136d98>
I, [2018-12-05T09:45:37.554262 #2945]  INFO -- : [Rollbar] Scheduling item
I, [2018-12-05T09:45:37.554346 #2945]  INFO -- : [Rollbar] Sending item
I, [2018-12-05T09:45:37.791762 #2945]  INFO -- : [Rollbar] Success
I, [2018-12-05T09:45:37.792057 #2945]  INFO -- : [Rollbar] Details: https://rollbar.com/instance/uuid?uuid=450bb21e-bb36-407c-94fd-79b9af1b8bc5 (only available if report was successful)
E, [2018-12-05T09:45:37.792204 #2945] ERROR -- : PushmiPullyu::AIP::Downloader::NoFileSets (PushmiPullyu::AIP::Downloader::NoFileSets)
/usr/lib64/ruby/gems/2.4.0/gems/pushmi_pullyu-1.0.6/lib/pushmi_pullyu/aip/downloader.rb:222:in `member_file_set_uuids'
/usr/lib64/ruby/gems/2.4.0/gems/pushmi_pullyu-1.0.6/lib/pushmi_pullyu/aip/downloader.rb:96:in `create_and_log_file_order_list'
/usr/lib64/ruby/gems/2.4.0/gems/pushmi_pullyu-1.0.6/lib/pushmi_pullyu/aip/downloader.rb:39:in `run'
/usr/lib64/ruby/gems/2.4.0/gems/pushmi_pullyu-1.0.6/lib/pushmi_pullyu/aip.rb:13:in `create'
/usr/lib64/ruby/gems/2.4.0/gems/pushmi_pullyu-1.0.6/lib/pushmi_pullyu/cli.rb:191:in `block in run_preservation_cycle'
/usr/lib64/ruby/gems/2.4.0/gems/rollbar-2.18.0/lib/rollbar.rb:146:in `scoped'
/usr/lib64/ruby/gems/2.4.0/gems/pushmi_pullyu-1.0.6/lib/pushmi_pullyu/cli.rb:188:in `run_preservation_cycle'
/usr/lib64/ruby/gems/2.4.0/gems/pushmi_pullyu-1.0.6/lib/pushmi_pullyu/cli.rb:210:in `run_tick_loop'
/usr/lib64/ruby/gems/2.4.0/gems/pushmi_pullyu-1.0.6/lib/pushmi_pullyu/cli.rb:53:in `start_server'
/usr/lib64/ruby/gems/2.4.0/gems/pushmi_pullyu-1.0.6/lib/pushmi_pullyu/cli.rb:275:in `block in start_server_as_daemon'
/usr/lib64/ruby/gems/2.4.0/gems/daemons-1.2.6/lib/daemons/application.rb:270:in `block in start_proc'
/usr/lib64/ruby/gems/2.4.0/gems/daemons-1.2.6/lib/daemons/daemonize.rb:84:in `call_as_daemon'
/usr/lib64/ruby/gems/2.4.0/gems/daemons-1.2.6/lib/daemons/application.rb:274:in `start_proc'
/usr/lib64/ruby/gems/2.4.0/gems/daemons-1.2.6/lib/daemons/application.rb:300:in `start'
/usr/lib64/ruby/gems/2.4.0/gems/daemons-1.2.6/lib/daemons/controller.rb:56:in `run'
/usr/lib64/ruby/gems/2.4.0/gems/daemons-1.2.6/lib/daemons.rb:197:in `block in run_proc'
/usr/lib64/ruby/gems/2.4.0/gems/daemons-1.2.6/lib/daemons/cmdline.rb:92:in `catch_exceptions'
/usr/lib64/ruby/gems/2.4.0/gems/daemons-1.2.6/lib/daemons.rb:196:in `run_proc'
/usr/lib64/ruby/gems/2.4.0/gems/pushmi_pullyu-1.0.6/lib/pushmi_pullyu/cli.rb:273:in `start_server_as_daemon'
/usr/lib64/ruby/gems/2.4.0/gems/pushmi_pullyu-1.0.6/lib/pushmi_pullyu/cli.rb:33:in `run'
/usr/lib64/ruby/gems/2.4.0/gems/pushmi_pullyu-1.0.6/exe/pushmi_pullyu:7:in `<top (required)>'
/usr/bin/pushmi_pullyu:22:in `load'
/usr/bin/pushmi_pullyu:22:in `<main>'

So the code looks fine (and it's obviously working as stuff is being successfully ingested into Swift). The fedora download and database owner change works.

You can see the object_metadata.n3 -- saved getting created and saved. This means that we hit fedora, got a successful 200 request, past this data that we got from Fedora to the PushmiPullyu::AIP::OwnerEmailEditor class, which goes through it (by using RDF/Graph/Reader/etc) and changes the USER ID to the USER EMAIL via hitting Postgres (and this is successful, otherwise we would be seeing a NoOwnerPredicate error being raised)

So from this you can conclude that, fedora is giving the proper info with predicates/rdf format/etc that we all love (and postgres is giving the email address correctly).

So with that said, it fails on grabbing the member_file_set_uuids from this object_metadata.n3 file. Which either means it's having trouble parsing the information, or simply the file has no information about the file_sets (aka the predicate http://pcdm.org/models#hasMember is empty or doesn't exist).

I'm going to lean with the latter of the two which means the error is correct! It has no filesets (I guess we could add better logging/etc here to confirm this or seeing what the file looks like etc).

So why doesn't it have filesets? My guess is if you look at Sidekiq, the error count is huge and is just under the success rate. So heres the graph from past week which includes Dec 5 (which is where the above stacktrace happened): image

You can see we had 75 jobs processed but 43 failed on Dec 5th!

We don't have enough information from the dashboard to tell which jobs are succeeding and failing but from Today you can see from the retries, we got 2 FileAttachmentIngestionJob's failing: image

Which means if these FileAttachmentIngestionJob's are failing and are accounting for the large failed jobs, then yes absolutely you won't have any FileSets information in Fedora by the time PuhsmiPullyu comes to grab the file. Which would account for these stacktraces we are experiencing.

Why are these jobs failing? It seems to be a 401 error from Fedora, my guess is a bad password? It seems around Dec 2, these errors started to happen: image

I know we changed the password very recently as rollbar leaked them again (hence the PR #175)?

Perhaps one of the servers haven't received the new fedora password? As this could be why some jobs are passing and others are failing when talking to Fedora? Eventually the retries do succeed, which probably depends on the server which grabs the job from redis. As a server may have the correct or wrong credentials resulting in an error or a success

So first step would be to make sure all 3 severs which are running sidekiq for jupiter have been given the updated credentials for fedora. And focus our efforts on getting this Sidekiq error count down to zero

Hopefully the password checking solves this. And if this error in PMPY keeps happening after we got our sidekiq error count as close zero as possible? Then we can look into it again (after ruling out Sidekiq is working correctly).

weiweishi commented 5 years ago

And I have verified - out of the three application servers, one of the public facing app servers is not configured properly, with the updated username. The changes were likely done manually - instead of from an Ansible playbook as the changes look differently on each app server. I will follow up with @henryzhang87 to have this corrected properly with the ansible playbook.

weiweishi commented 5 years ago

Continue to experience this issue: CRITICAL - (2 errors in check_logfiles.protocol-2019-01-02-10-37-29) - E, [2019-01-02T10:35:01.042067 #4752] ERROR -- : PushmiPullyu::AIP::Downloader::NoFileSets (PushmiPullyu::AIP::Downloader::NoFileSets) ... Double checked the application servers and configuration is corrected on all three application servers. Will verify the configuration on PMPY server.

weiweishi commented 5 years ago

This error happened 14 times last night from 21:31pm to 21:46pm, on various items: https://rollbar.com/ualbertalib/pushmi_pullyu/items/42/?item_page=0&item_count=100&#instances

These objects are not new deposits - likely an edit. Will track down swift credentials to investigate what's been preserved if there's any.

pgwillia commented 5 years ago

Looking at a sample of these items they seem to be Reviews or Conference/Workshop Presentation.

murny commented 5 years ago

Seems like a high error rate last night too in sidekiq? Could be worth looking at those logs too

screenshot_20190104-093807

weiweishi commented 5 years ago

More investigation into this reveals that the Fedora records for these objects are missing two triples that indicating the relationship between object and fileset.

This is a normal object: image This is an object that has an NoFileSet issue: image

Note that the missing list_source and 'member` children. (should include these two)

http://mycombe.library.ualberta.ca:8080/fedora/rest/prod/b3/15/6d/89/b3156d89-e674-4f34-9390-def3dac541a1/members
http://mycombe.library.ualberta.ca:8080/fedora/rest/prod/b3/15/6d/89/b3156d89-e674-4f34-9390-def3dac541a1/list_source

Can find the relationship on the object by look up object.file_sets.first. So the connection exists through ActiveStorage (for recent deposit). But can't seem to find the same information on the Fedora object. Wondering if PMPY should use the information in ActiveStorage, which would be more reliable, to fetch the information needed, than directly fetching through Fedora.