SDAPI: Script to find the identifiers of all records that made it into their hyku site

jillpe commented 4 months ago

Summary

When they migrated their works to hyku, there ended up being ~20,000 records missing (there's 750,000 on the old site and only 722,000 in hyku). Katharine tried to export all of the records/their metadata that are in her hyku site, to get their identifiers so she could try to find those records that are missing. It is taking too long, can we run a script to find all of those records? This is for the SDAPI tenant

Acceptance Criteria

[x] Script to find the identifiers of all records that made it into their hyku site has been run
[x] ID's are given to Katharine identifiers.txt

ShanaLMoore commented 4 months ago

suggested options, per Jeremy.

solr field_query loop.
active fedora find each loop.
maybe a postgres query

ShanaLMoore commented 4 months ago

After Katharine confirms what is missing, check fedora.

kirkkwang commented 4 months ago

script we used:

url = ENV['SOLR_URL'] + Account.find_by(name: 'sdapi').tenant
solr = RSolr.connect url: url

def get_identifiers(page = 1, rows_per_page = 100_000, solr)
  start_index = (page - 1) * rows_per_page
  response = solr.get 'select', params: { q: '*:*', fq: '-has_model_ssim:FileSet', fl: 'identifier_tesim', rows: rows_per_page, start: start_index }
  identifiers = response['response']['docs'].map { |doc| doc['identifier_tesim'] }
  puts identifiers.compact.count.to_s + " identifiers found for page #{page + 1}"
  identifiers
end

identifiers = []
page = 1
page_identifiers = get_identifiers(page, 100_000, solr)

while page_identifiers.size > 0
  puts "Getting identifiers for page #{page}"
  identifiers << page_identifiers
  page += 1
  page_identifiers = get_identifiers(page, 100_000, solr)
end

# create txt file and add identifiers to it
File.open('identifiers.txt', 'w') do |file|
  identifiers.flatten.uniq.compact.each { |id| file.puts id }
end

ShanaLMoore commented 4 months ago

https://assaydepot.slack.com/archives/C0313NJV9PE/p1709066085128159

scientist-softserv / adventist_knapsack

SDAPI: Script to find the identifiers of all records that made it into their hyku site #194

Summary

Acceptance Criteria