Closed christinach closed 1 month ago
Bucket should only have last 30 days of updates, has updates since 2021. @kevinreiss will follow up about this.
The bucket for partner updates currently has over 2500 objects in it. The list_objects
method for S3 currently only returns the first 1000 objects, meaning that when there are more objects than that, they are simply not returned. Right now we first list all the objects in the bucket, then whittle down that list based on date. Since we were only getting old objects, we were not getting any relevant items.
The new version of the lib_object api has a next_continuation_token
available that lets you request more objects if there are more than 1,000.
I've reached out to Barak Zahavy to check the status of the proposal by HTC (SCSB's vendor) to implement 30 day file retention on the S3 directories. If there are no plans to do so any longer or soon, we should clean this up as a one time thing.
On Fri, Oct 25, 2024 at 1:38 PM Max Kadel @.***> wrote:
The bucket for partner updates currently has over 2500 objects in it. The list_objects method for S3 currently only returns the first 1000 objects, meaning that when there are more objects than that, they are simply not returned. Right now we first list all the objects in the bucket, then whittle down that list based on date. Since we were only getting old objects, we were not getting any relevant items.
The new version of the lib_object api https://docs.aws.amazon.com/sdk-for-ruby/v3/api/Aws/S3/Types/ListObjectsV2Output.html has a next_continuation_token available that lets you request more objects if there are more than 1,000.
— Reply to this email directly, view it on GitHub https://github.com/pulibrary/bibdata/issues/2538#issuecomment-2438567021, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACECKKYOGL6JWRAEWEJAY3Z5KF3LAVCNFSM6AAAAABQOSG5VWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZYGU3DOMBSGE . You are receiving this because you were mentioned.Message ID: @.***>
Resolved by #2541
I checked today 10/26/2024 bibdata production and the partner updates are in place. I also checked a few records and they were indexed as expected.
Thanks @maxkadel for working on this!
@kevinreiss if there needs to be a cleanup of the SCSB directory please create a new ticket.
Expected behavior
Daily partner recap updates come into bibdata between 6-6:30am. It is a zipped file from scsb with changed records or deleted scsb ids.
Actual behavior
Daily partner job generates an empty json file
Impact of this bug
Daily partner updates that exist in scsb export directory are not imported into bibdata and not indexed in the catalog. SCSB records are not up to date. For SCSB updates, we rely on the full dump from SCSB when we run the full reindex.
Acceptance criteria
.rubocop_todo.yml
create a new ticket to address the rubocop todo.