pulibrary / bibdata

Local API for retrieving bibliographic and other useful data from Alma (Ruby 3.1.0, Rails 7.1.3.4)
BSD 2-Clause "Simplified" License
16 stars 7 forks source link

Investigate the figgy_ark_cache current behavior #2330

Closed sandbergja closed 3 months ago

sandbergja commented 4 months ago
kevinreiss commented 4 months ago

the production figgy ark cache is huge see this output: 83 GB!

deploy@bibdata-alma1:/data/bibdata_files$ du -h
0   ./campus_access_files
83G ./figgy_ark_cache
19G ./partner_update_scratch
12G ./scsb_update_files
kevinreiss commented 4 months ago

The last fix files modified in prod, nothing since July 2021.

-rwxr-xr-x 1 deploy www-data  43 Jul 16  2021 figgy_princeton_edu__catalog_json
-rwxr-xr-x 1 deploy www-data 135 Jul 16  2021 ark__88435_br86b689r
-rwxr-xr-x 1 deploy www-data 135 Jul 16  2021 ark__88435_rv042x39q
-rwxr-xr-x 1 deploy www-data 135 Jul 16  2021 ark__88435_df65vc17f
-rwxr-xr-x 1 deploy www-data 135 Jul 16  2021 ark__88435_6395wb422
kevinreiss commented 4 months ago

83990 non hidden files in the directory.

kevinreiss commented 4 months ago

Noting ALL files in the directory were created on 7/16/2021.

kevinreiss commented 4 months ago

This task does not deal with Alma at all. rake liberate:arks:clear_and_seed_cache is currently set to run at the same time everyday on both bibdata-alma2 and bibdata-alma-woker1. I'm not sure it's impacting anything but that seems necessary. Given the cache has been updated since 2021 I don't think it's doing anything at all.

kevinreiss commented 4 months ago

Clearly this https://github.com/pulibrary/bibdata/blob/f83b373fab64d3dc5edc6b7efd5cb9d8d6be1407/lib/tasks/orangeindex.rake#L145 isn't clearing anything. The system is correctly configured to find the cache in directory at /data/bibdata_files/figgy_ark_cache.

kevinreiss commented 4 months ago

Looks like the task requires a param be passed to it to decide which directory it should operate on: https://github.com/pulibrary/bibdata/blob/f83b373fab64d3dc5edc6b7efd5cb9d8d6be1407/lib/tasks/orangeindex.rake#L150. Other wise it will use this as a default /opt/bibdata/current/tmp/figgy_ark_cache, which has symlink to /opt/bibdata/shared/tmp/figgy_ark_cache. This directory exists but is empty on all four servers in prod.

kevinreiss commented 4 months ago

We think this is specifically broken because of changes to the blacklight json api (changed approximately two years ago) in figgy. The process calls this to build the cache. To make numismatics do this we had to fix things in this fashion: https://github.com/pulibrary/bibdata/pull/1965/files.

kevinreiss commented 4 months ago

See the results from https://github.com/pulibrary/figgy/issues/6324 and then re-assess this ticket.

maxkadel commented 3 months ago

DLS will work on this the week of May 8.

christinach commented 3 months ago

See further discussion in slack