ualbertalib / HydraNorth

This repo is deprecated. Succeeded by https://github.com/ualbertalib/jupiter. This codebase was a IR built based on Samvera/Sufia
11 stars 4 forks source link

Content updates from audit process #977

Closed pbinkley closed 7 years ago

pbinkley commented 8 years ago

Batch updates to ERA items, using Fedora REST API, to be followed by a full reindex job. (The rake job that adds ARKs will touch every item and trigger a reindexing, so it could be used as the reindex job.)

From Date Summit (Feb. 19):

Related (these have to run in the Rails environment and therefore will trigger item-level reindexing):

More to change

Complete data audit

Obsolete:

pbinkley commented 8 years ago

Note: code changes that depend on the reindexing job (i.e. that change the way items are indexed and searched such that items that haven't been reindexed won't be found) are accumulating the reindexathon branch, which should be merged to master and deployed only immediately before the big reindexing.

pbinkley commented 8 years ago

A demo ruby script to update a Fedora4 object in our demo environment is found here: https://gist.github.com/pbinkley/6a55cbb00da21b660880 . Given a list of noids, it deletes all creators and then adds a new creator. Each object is handled in its own transaction.

weiweishi commented 8 years ago
anayram commented 8 years ago

In case it helps, this document https://docs.google.com/spreadsheets/d/1soo3htIUbMVGzCsXoojJFXHD3Xaa-L6k_DHq12n4txI/edit#gid=0 contains the list of proquest files that were migrated into ERA.

weiweishi commented 8 years ago
weiweishi commented 8 years ago

Great! Thank Mariana. I will use this file as the source for proquest_id.

Weiwei Shi

Digital Initiative Applications Librarian University of Alberta Libraries 2-10L Cameron Library Edmonton, Alberta, Canada T6G 2J8 Phone:(780)492-7802 Fax: (780)248-1209 Email: weiwei.shi@ualberta.ca

On Thu, May 5, 2016 at 4:06 PM, Mariana Paredes-Olea < notifications@github.com> wrote:

In case it helps, this document https://docs.google.com/spreadsheets/d/1soo3htIUbMVGzCsXoojJFXHD3Xaa-L6k_DHq12n4txI/edit#gid=0 contains the list of proquest files that were migrated into ERA.

— You are receiving this because you were assigned. Reply to this email directly or view it on GitHub https://github.com/ualbertalib/HydraNorth/issues/977#issuecomment-217293549

weiweishi commented 8 years ago

More investigation needed for objects that have been migrated but doesn't have a datastream: https://docs.google.com/spreadsheets/d/1lC75myDRQXBCUQGuy9HYJf0lIkACfzu3ClJUhHenOcs/edit#gid=1245884274 These are the objects that haven't been characterized yet. @anayram @sfarnel

weiweishi commented 8 years ago

@anayram @sfarnel This is the solr query for objects that don't have a content datastream (as listed in the google sheet).

http://tottenham.library.ualberta.ca:8080/solr/hydranorth/select?q=has_model_ssim%3AGenericFile+AND+-mime_type_tesim%3A*+AND+-belongsToCommunity_ssim%3Awm117p010&sort=id+asc&rows=400&fl=id%2Ctitle_tesim%2Cdepositor_ssim%2Cfedora3uuid&wt=csv

Five of them in this solr query result set are objects that do have a content DS, but failed to be characterized. As listed in this sheet. I am currently working on getting the status of the objects from thesis deposit.

anayram commented 8 years ago

Thanks @weiweishi, will check this today.

pgwillia commented 7 years ago

FYI: these are the objects causing exceptions in the latest sitemap generation hydranorth_exception_ids.txt

grep sitemap production.log-2016100* | 
grep -E -o 'id:.{9}' > 
hydranorth_exception_ids.txt 
sfarnel commented 7 years ago

@anayram would you have some time to look at these? May be markup in abstracts or other oddities. Thanks!

On Thu, Oct 13, 2016 at 11:20 AM, pgwillia notifications@github.com wrote:

FYI: these are the objects causing exceptions in the latest sitemap generation hydranorth_exception_ids.txt https://github.com/ualbertalib/HydraNorth/files/527755/hydranorth_exception_ids.txt

grep sitemap production.log-2016100* | grep -E -o 'id:.{9}' > hydranorth_exception_ids.txt

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ualbertalib/HydraNorth/issues/977#issuecomment-253579228, or mute the thread https://github.com/notifications/unsubscribe-auth/AEevTFC5n9qtGFq5ptWejf49XAcSAa5Eks5qzmh3gaJpZM4HeX-e .

Sharon Farnel Metadata Coordinator University of Alberta Libraries sharon.farnel@ualberta.ca 780-492-3685

anayram commented 7 years ago

@sfarnel will do. At a quick look at least two of the objects' content datastreams are not available. Will report later.

sfarnel commented 7 years ago

Thanks!

On Thu, Oct 13, 2016 at 1:16 PM, Mariana Paredes-Olea < notifications@github.com> wrote:

@sfarnel https://github.com/sfarnel will do. At a quick look at least two of the objects' datastreams are not available. Will report later.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ualbertalib/HydraNorth/issues/977#issuecomment-253610864, or mute the thread https://github.com/notifications/unsubscribe-auth/AEevTJs8ZcmbgtNv0VrUpUah-FcCF-vwks5qzoOAgaJpZM4HeX-e .

Sharon Farnel Metadata Coordinator University of Alberta Libraries sharon.farnel@ualberta.ca 780-492-3685

anayram commented 7 years ago

Checked 12 of the 30 items included in Tricia's report and in all cases the content file is not available in ERA (file is ok in thesisdeposit though). There is one case of a .zip file that is downloadable from ERA but seems to be corrupted (also ok in thesis deposit).

Notes can be found here: https://docs.google.com/spreadsheets/d/1lC75myDRQXBCUQGuy9HYJf0lIkACfzu3ClJUhHenOcs/edit#gid=555550382

I was not able to find a pattern that doesn't sound too crazy but here it goes:

  1. There are a few filenames containing spaces; this caused problems before with OAI pdf links. The link in thesisdeposit replaces spaces with '+' but I am not sure if is this resolved in ERA's landing pages.
  2. The rest of the unreachable files have filename patterns that contain 3 or more underscores (I know this sounds very unlikely) - could there be some bug at the moment of ingesting the file?

Note: I was not able to locate item bdz010q17

weiweishi commented 7 years ago

Spot-checked a couple of objects, and the error messages are similar, like the one below. It might relates to the description/object fields:

ERROR [line: 71] With input '"People are constantly exposed to numerical information in their physical and social environments (e.': Invalid token "\"People" (found "\"People"), production = :RDFLiteral ERROR [line: 71] Unexpected (found "("), production = "." ERROR [line: 71] With input 'generally superior) averaging strategy. Finally, the seeding literature paints a more positive pictur': Invalid token "generally" (found "generally"), production = :collection ERROR [line: 71] With input 'veraging strategy. Finally, the seeding literature paints a more positive picture, demonstrating that': Invalid token "veraging" (found "veraging"), production = :predicateObjectList ERROR [line: 71] Unexpected (found "a"), production = ")" ERROR [line: 71] Expected one of "a", :IRIREF, :PNAME_LN, :PNAME_NS, production = :predicateObjectList ERROR [line: 71] With input 'd on controlled processes, (b) that people use different response modes (e.g., rejection, adoption, e': Invalid token "d" (found "d"), production = :base ERROR [line: 71] With input 'b) that people use different response modes (e.g., rejection, adoption, etc.) when they interact with': Invalid token "b)" (found "b)"), production = :collection ERROR [line: 71] Expected one of "a", :IRIREF, :PNAME_LN, :PNAME_NS, production = :predicateObjectList ERROR [line: 71] With input 'c) that this latter assertion necessitates an analysis on a response mode level to understand the rel': Invalid token "c)" (found "c)"), production = :collection ERROR [line: 71] With input 'ssertion necessitates an analysis on a response mode level to understand the relevant phenomena, and ': Invalid token "ssertion" (found "ssertion"), production = :predicateObjectList ERROR [line: 71] With input 'd) that seeds, advice, and anchors can be conceptualized as numerical information varying along a sou': Invalid token "d)" (found "d)"), production = :collection ERROR [line: 71] With input 'dvice, and anchors can be conceptualized as numerical information varying along a source credibility ': Invalid token "dvice," (found "dvice,"), production = :predicateObjectList ERROR [line: 71] Unexpected (found "a"), production = ")"

I will try to re-ingest one of these objects and see if the issue still exists.

Weiwei Shi

Digital Initiative Applications Librarian University of Alberta Libraries 2-10L Cameron Library Edmonton, Alberta, Canada T6G 2J8 Phone:(780)492-7802 Fax: (780)248-1209 Email: weiwei.shi@ualberta.ca

On Thu, Oct 13, 2016 at 3:38 PM, Mariana Paredes-Olea < notifications@github.com> wrote:

Checked 12 of the 30 items included in Tricia's report and in all cases the content file is not available in ERA (file is ok in thesisdeposit though). There is one case of a .zip file that is downloadable from ERA but seems to be corrupted (also ok in thesis deposit).

Notes can be found here: https://docs.google.com/spreadsheets/d/ 1lC75myDRQXBCUQGuy9HYJf0lIkACfzu3ClJUhHenOcs/edit#gid=555550382

I was not able to find a pattern that doesn't sound too crazy but here it goes:

  1. There are a few filenames containing spaces; this caused problems before with OAI pdf links. The link in thesisdeposit replaces spaces with '+' but I am not sure if is this resolved in ERA's landing pages.
  2. The rest of the unreachable files have filename patterns that contain 3 or more underscores (I know this sounds very unlikely) - could there be some bug at the moment of ingesting the file?

Note: I was not able to locate item bdz010q17

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ualbertalib/HydraNorth/issues/977#issuecomment-253646544, or mute the thread https://github.com/notifications/unsubscribe-auth/AB8-fne-uve6JJn_WjyBEu6yOLXOjXSfks5qzqTggaJpZM4HeX-e .

leahvanderjagt commented 7 years ago

Hi there, just popping in to say that if there are any records you would like us to review (Anna or one of our students depending on complexity of question) please let me know. Cheers, Leah

On Thu, Oct 13, 2016 at 3:47 PM, Weiwei Shi notifications@github.com wrote:

Spot-checked a couple of objects, and the error messages are similar, like the one below. It might relates to the description/object fields:

ERROR [line: 71] With input '"People are constantly exposed to numerical information in their physical and social environments (e.': Invalid token "\"People" (found "\"People"), production = :RDFLiteral ERROR [line: 71] Unexpected (found "("), production = "." ERROR [line: 71] With input 'generally superior) averaging strategy. Finally, the seeding literature paints a more positive pictur': Invalid token "generally" (found "generally"), production = :collection ERROR [line: 71] With input 'veraging strategy. Finally, the seeding literature paints a more positive picture, demonstrating that': Invalid token "veraging" (found "veraging"), production = :predicateObjectList ERROR [line: 71] Unexpected (found "a"), production = ")" ERROR [line: 71] Expected one of "a", :IRIREF, :PNAME_LN, :PNAME_NS, production = :predicateObjectList ERROR [line: 71] With input 'd on controlled processes, (b) that people use different response modes (e.g., rejection, adoption, e': Invalid token "d" (found "d"), production = :base ERROR [line: 71] With input 'b) that people use different response modes (e.g., rejection, adoption, etc.) when they interact with': Invalid token "b)" (found "b)"), production = :collection ERROR [line: 71] Expected one of "a", :IRIREF, :PNAME_LN, :PNAME_NS, production = :predicateObjectList ERROR [line: 71] With input 'c) that this latter assertion necessitates an analysis on a response mode level to understand the rel': Invalid token "c)" (found "c)"), production = :collection ERROR [line: 71] With input 'ssertion necessitates an analysis on a response mode level to understand the relevant phenomena, and ': Invalid token "ssertion" (found "ssertion"), production = :predicateObjectList ERROR [line: 71] With input 'd) that seeds, advice, and anchors can be conceptualized as numerical information varying along a sou': Invalid token "d)" (found "d)"), production = :collection ERROR [line: 71] With input 'dvice, and anchors can be conceptualized as numerical information varying along a source credibility ': Invalid token "dvice," (found "dvice,"), production = :predicateObjectList ERROR [line: 71] Unexpected (found "a"), production = ")"

I will try to re-ingest one of these objects and see if the issue still exists.

Weiwei Shi

Digital Initiative Applications Librarian University of Alberta Libraries 2-10L Cameron Library Edmonton, Alberta, Canada T6G 2J8 Phone:(780)492-7802 Fax: (780)248-1209 Email: weiwei.shi@ualberta.ca

On Thu, Oct 13, 2016 at 3:38 PM, Mariana Paredes-Olea < notifications@github.com> wrote:

Checked 12 of the 30 items included in Tricia's report and in all cases the content file is not available in ERA (file is ok in thesisdeposit though). There is one case of a .zip file that is downloadable from ERA but seems to be corrupted (also ok in thesis deposit).

Notes can be found here: https://docs.google.com/spreadsheets/d/ 1lC75myDRQXBCUQGuy9HYJf0lIkACfzu3ClJUhHenOcs/edit#gid=555550382

I was not able to find a pattern that doesn't sound too crazy but here it goes:

  1. There are a few filenames containing spaces; this caused problems before with OAI pdf links. The link in thesisdeposit replaces spaces with '+' but I am not sure if is this resolved in ERA's landing pages.
  2. The rest of the unreachable files have filename patterns that contain 3 or more underscores (I know this sounds very unlikely) - could there be some bug at the moment of ingesting the file?

Note: I was not able to locate item bdz010q17

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ualbertalib/HydraNorth/issues/ 977#issuecomment-253646544, or mute the thread https://github.com/notifications/unsubscribe-auth/AB8-fne-uve6JJn_ WjyBEu6yOLXOjXSfks5qzqTggaJpZM4HeX-e .

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ualbertalib/HydraNorth/issues/977#issuecomment-253648459, or mute the thread https://github.com/notifications/unsubscribe-auth/AFieXHp4iN0lwWLznzBBU-vXkH_WDqywks5qzqbVgaJpZM4HeX-e .


Leah VanderjagtDigital Repository Services Coordinator University of Albertat. 780.492.3851 leahv@ualberta.ca leahv@ualberta.ca

weiweishi commented 7 years ago

The issue (with the characters/spaces in abstract that caused invalid RDF triples) is fixed by deleting the old records and remigrating the objects. I will work through the list.

Weiwei Shi

Digital Initiative Applications Librarian University of Alberta Libraries 2-10L Cameron Library Edmonton, Alberta, Canada T6G 2J8 Phone:(780)492-7802 Fax: (780)248-1209 Email: weiwei.shi@ualberta.ca

On Thu, Oct 13, 2016 at 3:46 PM, Weiwei Shi weiwei.shi@ualberta.ca wrote:

Spot-checked a couple of objects, and the error messages are similar, like the one below. It might relates to the description/object fields:

ERROR [line: 71] With input '"People are constantly exposed to numerical information in their physical and social environments (e.': Invalid token "\"People" (found "\"People"), production = :RDFLiteral ERROR [line: 71] Unexpected (found "("), production = "." ERROR [line: 71] With input 'generally superior) averaging strategy. Finally, the seeding literature paints a more positive pictur': Invalid token "generally" (found "generally"), production = :collection ERROR [line: 71] With input 'veraging strategy. Finally, the seeding literature paints a more positive picture, demonstrating that': Invalid token "veraging" (found "veraging"), production = :predicateObjectList ERROR [line: 71] Unexpected (found "a"), production = ")" ERROR [line: 71] Expected one of "a", :IRIREF, :PNAME_LN, :PNAME_NS, production = :predicateObjectList ERROR [line: 71] With input 'd on controlled processes, (b) that people use different response modes (e.g., rejection, adoption, e': Invalid token "d" (found "d"), production = :base ERROR [line: 71] With input 'b) that people use different response modes (e.g., rejection, adoption, etc.) when they interact with': Invalid token "b)" (found "b)"), production = :collection ERROR [line: 71] Expected one of "a", :IRIREF, :PNAME_LN, :PNAME_NS, production = :predicateObjectList ERROR [line: 71] With input 'c) that this latter assertion necessitates an analysis on a response mode level to understand the rel': Invalid token "c)" (found "c)"), production = :collection ERROR [line: 71] With input 'ssertion necessitates an analysis on a response mode level to understand the relevant phenomena, and ': Invalid token "ssertion" (found "ssertion"), production = :predicateObjectList ERROR [line: 71] With input 'd) that seeds, advice, and anchors can be conceptualized as numerical information varying along a sou': Invalid token "d)" (found "d)"), production = :collection ERROR [line: 71] With input 'dvice, and anchors can be conceptualized as numerical information varying along a source credibility ': Invalid token "dvice," (found "dvice,"), production = :predicateObjectList ERROR [line: 71] Unexpected (found "a"), production = ")"

I will try to re-ingest one of these objects and see if the issue still exists.

Weiwei Shi

Digital Initiative Applications Librarian University of Alberta Libraries 2-10L Cameron Library Edmonton, Alberta, Canada T6G 2J8 Phone:(780)492-7802 Fax: (780)248-1209 Email: weiwei.shi@ualberta.ca

On Thu, Oct 13, 2016 at 3:38 PM, Mariana Paredes-Olea < notifications@github.com> wrote:

Checked 12 of the 30 items included in Tricia's report and in all cases the content file is not available in ERA (file is ok in thesisdeposit though). There is one case of a .zip file that is downloadable from ERA but seems to be corrupted (also ok in thesis deposit).

Notes can be found here: https://docs.google.com/spread sheets/d/1lC75myDRQXBCUQGuy9HYJf0lIkACfzu3ClJUhHenOcs/edit#gid=555550382

I was not able to find a pattern that doesn't sound too crazy but here it goes:

  1. There are a few filenames containing spaces; this caused problems before with OAI pdf links. The link in thesisdeposit replaces spaces with '+' but I am not sure if is this resolved in ERA's landing pages.
  2. The rest of the unreachable files have filename patterns that contain 3 or more underscores (I know this sounds very unlikely) - could there be some bug at the moment of ingesting the file?

Note: I was not able to locate item bdz010q17

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ualbertalib/HydraNorth/issues/977#issuecomment-253646544, or mute the thread https://github.com/notifications/unsubscribe-auth/AB8-fne-uve6JJn_WjyBEu6yOLXOjXSfks5qzqTggaJpZM4HeX-e .

sfarnel commented 7 years ago

Thanks Weiwei!

On Wed, Oct 19, 2016 at 4:38 PM, Weiwei Shi notifications@github.com wrote:

The issue (with the characters/spaces in abstract that caused invalid RDF triples) is fixed by deleting the old records and remigrating the objects. I will work through the list.

Weiwei Shi

Digital Initiative Applications Librarian University of Alberta Libraries 2-10L Cameron Library Edmonton, Alberta, Canada T6G 2J8 Phone:(780)492-7802 Fax: (780)248-1209 Email: weiwei.shi@ualberta.ca

On Thu, Oct 13, 2016 at 3:46 PM, Weiwei Shi weiwei.shi@ualberta.ca wrote:

Spot-checked a couple of objects, and the error messages are similar, like the one below. It might relates to the description/object fields:

ERROR [line: 71] With input '"People are constantly exposed to numerical information in their physical and social environments (e.': Invalid token "\"People" (found "\"People"), production = :RDFLiteral ERROR [line: 71] Unexpected (found "("), production = "." ERROR [line: 71] With input 'generally superior) averaging strategy. Finally, the seeding literature paints a more positive pictur': Invalid token "generally" (found "generally"), production = :collection ERROR [line: 71] With input 'veraging strategy. Finally, the seeding literature paints a more positive picture, demonstrating that': Invalid token "veraging" (found "veraging"), production = :predicateObjectList ERROR [line: 71] Unexpected (found "a"), production = ")" ERROR [line: 71] Expected one of "a", :IRIREF, :PNAME_LN, :PNAME_NS, production = :predicateObjectList ERROR [line: 71] With input 'd on controlled processes, (b) that people use different response modes (e.g., rejection, adoption, e': Invalid token "d" (found "d"), production = :base ERROR [line: 71] With input 'b) that people use different response modes (e.g., rejection, adoption, etc.) when they interact with': Invalid token "b)" (found "b)"), production = :collection ERROR [line: 71] Expected one of "a", :IRIREF, :PNAME_LN, :PNAME_NS, production = :predicateObjectList ERROR [line: 71] With input 'c) that this latter assertion necessitates an analysis on a response mode level to understand the rel': Invalid token "c)" (found "c)"), production = :collection ERROR [line: 71] With input 'ssertion necessitates an analysis on a response mode level to understand the relevant phenomena, and ': Invalid token "ssertion" (found "ssertion"), production = :predicateObjectList ERROR [line: 71] With input 'd) that seeds, advice, and anchors can be conceptualized as numerical information varying along a sou': Invalid token "d)" (found "d)"), production = :collection ERROR [line: 71] With input 'dvice, and anchors can be conceptualized as numerical information varying along a source credibility ': Invalid token "dvice," (found "dvice,"), production = :predicateObjectList ERROR [line: 71] Unexpected (found "a"), production = ")"

I will try to re-ingest one of these objects and see if the issue still exists.

Weiwei Shi

Digital Initiative Applications Librarian University of Alberta Libraries 2-10L Cameron Library Edmonton, Alberta, Canada T6G 2J8 Phone:(780)492-7802 Fax: (780)248-1209 Email: weiwei.shi@ualberta.ca

On Thu, Oct 13, 2016 at 3:38 PM, Mariana Paredes-Olea < notifications@github.com> wrote:

Checked 12 of the 30 items included in Tricia's report and in all cases the content file is not available in ERA (file is ok in thesisdeposit though). There is one case of a .zip file that is downloadable from ERA but seems to be corrupted (also ok in thesis deposit).

Notes can be found here: https://docs.google.com/spread sheets/d/1lC75myDRQXBCUQGuy9HYJf0lIkACfzu3ClJUhHenOcs/edit#gid= 555550382

I was not able to find a pattern that doesn't sound too crazy but here it goes:

  1. There are a few filenames containing spaces; this caused problems before with OAI pdf links. The link in thesisdeposit replaces spaces with '+' but I am not sure if is this resolved in ERA's landing pages.
  2. The rest of the unreachable files have filename patterns that contain 3 or more underscores (I know this sounds very unlikely) - could there be some bug at the moment of ingesting the file?

Note: I was not able to locate item bdz010q17

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ualbertalib/HydraNorth/issues/ 977#issuecomment-253646544, or mute the thread https://github.com/notifications/unsubscribe-auth/AB8-fne-uve6JJn_ WjyBEu6yOLXOjXSfks5qzqTggaJpZM4HeX-e .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ualbertalib/HydraNorth/issues/977#issuecomment-254961403, or mute the thread https://github.com/notifications/unsubscribe-auth/AEevTDb7Wh-oIoPJ43Ak_TKOte9xJTgJks5q1pvQgaJpZM4HeX-e .

Sharon Farnel Metadata Coordinator University of Alberta Libraries sharon.farnel@ualberta.ca 780-492-3685

weiweishi commented 7 years ago
pbinkley commented 7 years ago

A picture of beer goes here.