Open jillpe opened 4 months ago
suggested options, per Jeremy.
After Katharine confirms what is missing, check fedora.
script we used:
url = ENV['SOLR_URL'] + Account.find_by(name: 'sdapi').tenant
solr = RSolr.connect url: url
def get_identifiers(page = 1, rows_per_page = 100_000, solr)
start_index = (page - 1) * rows_per_page
response = solr.get 'select', params: { q: '*:*', fq: '-has_model_ssim:FileSet', fl: 'identifier_tesim', rows: rows_per_page, start: start_index }
identifiers = response['response']['docs'].map { |doc| doc['identifier_tesim'] }
puts identifiers.compact.count.to_s + " identifiers found for page #{page + 1}"
identifiers
end
identifiers = []
page = 1
page_identifiers = get_identifiers(page, 100_000, solr)
while page_identifiers.size > 0
puts "Getting identifiers for page #{page}"
identifiers << page_identifiers
page += 1
page_identifiers = get_identifiers(page, 100_000, solr)
end
# create txt file and add identifiers to it
File.open('identifiers.txt', 'w') do |file|
identifiers.flatten.uniq.compact.each { |id| file.puts id }
end
Summary
When they migrated their works to hyku, there ended up being ~20,000 records missing (there's 750,000 on the old site and only 722,000 in hyku). Katharine tried to export all of the records/their metadata that are in her hyku site, to get their identifiers so she could try to find those records that are missing. It is taking too long, can we run a script to find all of those records? This is for the SDAPI tenant
Acceptance Criteria