scientist-softserv / adventist_knapsack

Apache License 2.0
1 stars 0 forks source link

SDAPI: Script to find the identifiers of all records that made it into their hyku site #194

Open jillpe opened 4 months ago

jillpe commented 4 months ago

Summary

When they migrated their works to hyku, there ended up being ~20,000 records missing (there's 750,000 on the old site and only 722,000 in hyku). Katharine tried to export all of the records/their metadata that are in her hyku site, to get their identifiers so she could try to find those records that are missing. It is taking too long, can we run a script to find all of those records? This is for the SDAPI tenant

Acceptance Criteria

ShanaLMoore commented 4 months ago

suggested options, per Jeremy.

ShanaLMoore commented 4 months ago

After Katharine confirms what is missing, check fedora.

kirkkwang commented 4 months ago

script we used:

url = ENV['SOLR_URL'] + Account.find_by(name: 'sdapi').tenant
solr = RSolr.connect url: url

def get_identifiers(page = 1, rows_per_page = 100_000, solr)
  start_index = (page - 1) * rows_per_page
  response = solr.get 'select', params: { q: '*:*', fq: '-has_model_ssim:FileSet', fl: 'identifier_tesim', rows: rows_per_page, start: start_index }
  identifiers = response['response']['docs'].map { |doc| doc['identifier_tesim'] }
  puts identifiers.compact.count.to_s + " identifiers found for page #{page + 1}"
  identifiers
end

identifiers = []
page = 1
page_identifiers = get_identifiers(page, 100_000, solr)

while page_identifiers.size > 0
  puts "Getting identifiers for page #{page}"
  identifiers << page_identifiers
  page += 1
  page_identifiers = get_identifiers(page, 100_000, solr)
end

# create txt file and add identifiers to it
File.open('identifiers.txt', 'w') do |file|
  identifiers.flatten.uniq.compact.each { |id| file.puts id }
end
ShanaLMoore commented 4 months ago

https://assaydepot.slack.com/archives/C0313NJV9PE/p1709066085128159