Open jefferya opened 7 months ago
New items found in the last two weeks:
Rough query to find items without an attached file (some may be on purpose
results = ActiveRecord::Base.connection.execute("SELECT i.id FROM items AS i WHERE NOT EXISTS (SELECT 1 FROM active_storage_attachments AS a WHERE record_type = 'Item' and i.id = a.record_id) order by i.id")
Rough query to find theses without an attached file (some may be on purpose
results = ActiveRecord::Base.connection.execute("SELECT i.id FROM theses AS i WHERE NOT EXISTS (SELECT 1 FROM active_storage_attachments AS a WHERE record_type = 'Thesis' and i.id = a.record_id) order by i.id")
Find Items with the same title as an record without and attachment
ActiveRecord::Base.connection.execute("SELECT concat('https://era.library.ualberta.ca/items/', i2.id), i2.id, i2.title, 'Item', depositor, member_of_paths, ingest_batch, created_at, updated_at FROM items as i2 WHERE i2.title in (SELECT i.title FROM items AS i WHERE NOT EXISTS (SELECT 1 FROM active_storage_attachments AS a WHERE record_type = 'Item' and i.id = a.record_id)) order by i2.title")
Find theses with the same title as an record without and attachment
ActiveRecord::Base.connection.execute("SELECT concat('https://era.library.ualberta.ca/items/', i2.id), i2.id, i2.title, 'Thesis', depositor, member_of_paths, ingest_batch, created_at, updated_at FROM theses as i2 WHERE i2.title in (SELECT i.title FROM theses AS i WHERE NOT EXISTS (SELECT 1 FROM active_storage_attachments AS
a WHERE record_type = 'Thesis' and i.id = a.record_id)) order by i2.title")
Find Items with the same title as an record without and attachment: add sometimes present name of the attached file to help
results = ActiveRecord::Base.connection.execute("SELECT concat('https://era.library.ualberta.ca/items/', i2.id), i2.id, i2.title, as_b.filename, 'Item', i2.depositor, i2.ingest_batch, i2.created_at, i2.updated_at, i2.member_of_paths FROM items as i2 LEFT OUTER JOIN active_storage_attachments as as_a on i2.id=as_a.record_id and as_a.record_type='Item' LEFT OUTER JOIN active_storage_blobs as as_b ON as_b.id=as_a.blob_id WHERE i2.title in (SELECT i.title FROM items AS i WHERE NOT EXISTS (SELECT 1 FROM active_storage_attachments AS a WHERE record_type = 'Item' and i.id = a.record_id)) order by i2.title, as_b.filename")
CSV.open('/era_tmp/delete_me_items.csv', 'wb') do |csv|
results.each do |row|
csv << row.values
end
end
Find theses with the same title as an record without and attachment: add sometimes present name of the attached file to help
results = ActiveRecord::Base.connection.execute("SELECT concat('https://era.library.ualberta.ca/items/', i2.id), i2.id, i2.title, as_b.filename, 'Thesis', i2.depositor, i2.ingest_batch, i2.created_at, i2.updated_at, i2.member_of_paths FROM theses as i2 LEFT OUTER JOIN active_storage_attachments as as_a on i2.id=as_a.record_id and as_a.record_type='Thesis' LEFT OUTER JOIN active_storage_blobs as as_b ON as_b.id=as_a.blob_id WHERE i2.title in (SELECT i.title FROM theses AS i WHERE NOT EXISTS (SELECT 1 FROM active_storage_attachments AS a WHERE record_type = 'Thesis' and i.id = a.record_id)) order by i2.title, as_b.filename")
CSV.open('/era_tmp/delete_me_theses.csv', 'wb') do |csv|
results.each do |row|
csv << row.values
end
end
Notes:
Examples like the follow will likely need additional metadata to avoid Google labeling resources with similar metadata but different file attachments as "Duplicate, Google chose different canonical than user". Some examples
Updated list sent to the ERA service team for review
Related to #3289
When the sitemap filter is applied to Google Search Console "Duplicate without user-selected canonical", three items appeared where Google thinks the content is similar to another item in the sitemap. Upon investigating the Google Search Console URL inspection, the "User-declared canonical" and "Google-selected canonical" appear very similar. E-mail sent to the erahelp team for advice (Jan 31; ref. https://github.com/ualbertalib/jupiter/issues/3289#issuecomment-1887667840).
The next week, the Google Search Console reported additional items. These items seem to be older (i.e., not added in the last week).
Question: is there a way to test for duplicates more efficiently than Google?
Attempt 1.: use the Active Storage database table
active_storage_blob
columnchecksum
to verify each attachment is unique therefore finding any duplicate items.The number of blobs and attachments seems high relative to the number of items and thesis. Could this be related to #3248?
Let's test, each active_storage_blob should appear only once for each unique attachment, right?
Why are there so many blobs with the same checksum? Let's filter the attachment count by
record_type
Why the decrease in numbers?
Maybe due to the filter on the record_type? The answer seems like "yes" from the below
In the list of duplicated checksums, let's find all the record_ids that have attachments to a duplicated checksum (Item or Thesis record types with attachment name = "file". This output will return draft items if they are attached to a duplicate checksum.
Let's filter out the DraftItems and DraftThesis
Let's write this to a CSV file
Let's check if there are records (Item & Thesis) with multiple attachments with the same checksum (i.e., a file attached to a record multiple times):
Are these intentional?
Let's output nicely in a similar format to the duplicate records finder
Google sheet shared: https://docs.google.com/spreadsheets/d/1khOWEk2XusG98vafWBgzACmbM-a5TR7K4Xzy1VcZl6M/edit#gid=1219983193