pulibrary / bibdata

Local API for retrieving bibliographic and other useful data from Alma (Ruby 3.2.0, Rails 7.1.3.4)
BSD 2-Clause "Simplified" License
16 stars 7 forks source link

Daily partner updates are saved as an empty json #2538

Closed christinach closed 1 month ago

christinach commented 1 month ago

Expected behavior

Daily partner recap updates come into bibdata between 6-6:30am. It is a zipped file from scsb with changed records or deleted scsb ids.

Actual behavior

Daily partner job generates an empty json file

Impact of this bug

Daily partner updates that exist in scsb export directory are not imported into bibdata and not indexed in the catalog. SCSB records are not up to date. For SCSB updates, we rely on the full dump from SCSB when we run the full reindex.

Acceptance criteria

maxkadel commented 1 month ago

Bucket should only have last 30 days of updates, has updates since 2021. @kevinreiss will follow up about this.

maxkadel commented 1 month ago

The bucket for partner updates currently has over 2500 objects in it. The list_objects method for S3 currently only returns the first 1000 objects, meaning that when there are more objects than that, they are simply not returned. Right now we first list all the objects in the bucket, then whittle down that list based on date. Since we were only getting old objects, we were not getting any relevant items.

The new version of the lib_object api has a next_continuation_token available that lets you request more objects if there are more than 1,000.

kevinreiss commented 1 month ago

I've reached out to Barak Zahavy to check the status of the proposal by HTC (SCSB's vendor) to implement 30 day file retention on the S3 directories. If there are no plans to do so any longer or soon, we should clean this up as a one time thing.

On Fri, Oct 25, 2024 at 1:38 PM Max Kadel @.***> wrote:

The bucket for partner updates currently has over 2500 objects in it. The list_objects method for S3 currently only returns the first 1000 objects, meaning that when there are more objects than that, they are simply not returned. Right now we first list all the objects in the bucket, then whittle down that list based on date. Since we were only getting old objects, we were not getting any relevant items.

The new version of the lib_object api https://docs.aws.amazon.com/sdk-for-ruby/v3/api/Aws/S3/Types/ListObjectsV2Output.html has a next_continuation_token available that lets you request more objects if there are more than 1,000.

— Reply to this email directly, view it on GitHub https://github.com/pulibrary/bibdata/issues/2538#issuecomment-2438567021, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACECKKYOGL6JWRAEWEJAY3Z5KF3LAVCNFSM6AAAAABQOSG5VWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZYGU3DOMBSGE . You are receiving this because you were mentioned.Message ID: @.***>

christinach commented 1 month ago

Resolved by #2541

I checked today 10/26/2024 bibdata production and the partner updates are in place. I also checked a few records and they were indexed as expected.

Thanks @maxkadel for working on this!

@kevinreiss if there needs to be a cleanup of the SCSB directory please create a new ticket.