tulibraries / dplah

Hydra-powered DPLA aggregator prototype; Currently used for legacy data, to eventually be replaced
Other
14 stars 1 forks source link

Configure aggregator to work with Islandora #134

Open dorevabelfiore-temple opened 8 years ago

dorevabelfiore-temple commented 8 years ago

We will need to configure the aggregator to work with Islandora instances. I realize that not all Islandora instances are equal, so this may take longer. Currently we are thinking of potential work with 2 local partners, TBD. Please see me for specific details if you need.

EliseTemple commented 8 years ago

Doreva & Leanne will assess the XML first and then schedule a meeting with Chad once we have some test data.

dorevabelfiore-temple commented 8 years ago

Doreva is working with some institutions: UPITT Historic Pittsburgh, Presbyterian Historical Society, maybe Lafayette College. APS not working. Drexel DUCOM not ready yet. Drexel main campus may not have in-scope content.

dorevabelfiore-temple commented 8 years ago

We have OAI data from UPITT and PHS and @lfinnigan and I are reviewing it.

dorevabelfiore-temple commented 8 years ago

Our team reviewed the data and created a draft mapping: See 2nd tab "Field Mapping" here : https://docs.google.com/spreadsheets/d/1pDDoayXkO71UrvmG9czlR01qDau678IyYicmWqAZJ5c/edit#gid=205346128

Let us know if there are any questions! Thanks!

dorevabelfiore-temple commented 7 years ago

Chad adjusted the thumbnail path to the master path for Islandora instances.

bibliotechy commented 7 years ago

Started work on this issue in Islandora harvests branch. Harvests work, as expected, but not creating thumbnail.

To reproduce, create a new seed with these details and run a harvest.

Name: Historic Pittsburgh Endpoint url: http://historicpittsburgh.org/oai2 Metadata prefix: oai_dc Set: pitt_collection.33 Collection Name: Aerial Photographs of Pittsburgh Contributing Institution: Pitt Intermediate Provider: Common Repository Type (if applicable): Islandora Thumbnail Pattern (if applicable): Thumbnail Token 1 (if applicable): Thumbnail Token 2 (if applicable): Provider ID Prefix: PITT

bibliotechy commented 7 years ago

@lfinnigan @dorevabelfiore-temple @skng5

Some records in the original OAI feed have fields with spaces in the middle of them which is causing the harvest to crash. For example, pitt_collecction.33 record pitt:886.18159.AP has these two identifiers

<dc:identifier>pitt:886.18159.AP</dc:identifier>
<dc:identifier>pitt: 886.18159.AP</dc:identifier>

I've encountered this issue in pitt_collection.15 as well, so I'm assuming it is a recurrent issue throughout the collection.

Do you want me to try to compensate for this, or have them fix their data?

dorevabelfiore-temple commented 7 years ago

This is test test case for why #130 is important. I would tell them to fix their data, since this looks obviously wrong. Chad, we tested 20 collections in our first pass. These are the indicated ones with names on the first tab of the sheet. Can you focus on those colls first in your test? If you need more I can assess a few more on Monday.

dorevabelfiore-temple commented 7 years ago

Confirming that this will be our primary harvesting work for Spring (LSTA Q2 grant = January - March 2017).

dorevabelfiore-temple commented 7 years ago

Sent email to Historic Pittsburgh to investigate.

dorevabelfiore-temple commented 7 years ago

As of 1/13 UPITT is working on this. More anon.

dorevabelfiore-temple commented 7 years ago

Chad says these are some type of unicode characters that are causing the issue.

dorevabelfiore-temple commented 7 years ago

Code pushed to DEV & can be tested.

dorevabelfiore-temple commented 7 years ago

5 collections tested in DEV. 3 worked fine. 2 stopped due to UTF-8 encoding problems. Here are the resque errors:

li1031-155.members.linode.com:12719 on HARVEST at about an hour ago Retry or Remove Class Harvest Arguments

id: 324 name: PHLC Historic Pittsburgh City Directories description: endpoint_url: http://historicpittsburgh.org/oai2 metadata_prefix: oai_dc set: pitt_collection.49 contributing_institution: Pittsburgh History and Landmarks Foundation collection_name: Historic Pittsburgh City Directories created_at: '2017-03-09T15:39:12.958Z' updated_at: '2017-03-09T15:39:29.526Z' set_spec: in_production: 'No' new_contributing_institution: Pittsburgh History and Landmarks Foundation email: '' provider_id_prefix: PHLC new_provider_id_prefix: PHLC new_endpoint_url: '' common_repository_type: Islandora thumbnail_pattern: '' thumbnail_token_1: '' thumbnail_token_2: '' thumbnail_explanation: common_transformation: intermediate_provider: Historic Pittsburgh new_intermediate_provider: '' new_email: '' rights_statement: '' identifier_pattern: '' identifier_token: '' types_mapping: type_image: '' type_text: '' type_moving_image: '' type_sound: '' type_physical_object: '' contributing_institution_dc_field: '' last_harvested: '' Exception Encoding::CompatibilityError Error incompatible character encodings: ASCII-8BIT and UTF-8

id: 322 name: Frick Collection Frick Business Records description: endpoint_url: http://historicpittsburgh.org/oai2 metadata_prefix: oai_dc set: pitt_collection.156 contributing_institution: Frick Collection collection_name: Henry Clay Frick Business Records created_at: '2017-03-09T15:05:03.237Z' updated_at: '2017-03-09T15:05:03.237Z' set_spec: in_production: 'No' new_contributing_institution: Frick Collection email: '' provider_id_prefix: FRICK new_provider_id_prefix: FRICK new_endpoint_url: '' common_repository_type: Islandora thumbnail_pattern: '' thumbnail_token_1: '' thumbnail_token_2: '' thumbnail_explanation: common_transformation: intermediate_provider: Historic Pittsburgh new_intermediate_provider: '' new_email: '' rights_statement: '' identifier_pattern: '' identifier_token: '' types_mapping: type_image: '' type_text: '' type_moving_image: '' type_sound: '' type_physical_object: '' contributing_institution_dc_field: '' last_harvested: '' Exception Encoding::CompatibilityError Error incompatible character encodings: ASCII-8BIT and UTF-8

bibliotechy commented 7 years ago

We pushed a change that should trap and quarantine these incompatible character encoding files. So please try ingesting these collections again.

dorevabelfiore-temple commented 7 years ago

Tested 2 and they looked good. Running another 2 tests today.

On Mon, Mar 20, 2017 at 2:42 PM, Chad Nelson notifications@github.com wrote:

We pushed a change that should trap and quarantine these incompatible character encoding files. So please try ingesting these collections again.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tulibraries/dplah/issues/134#issuecomment-287858455, or mute the thread https://github.com/notifications/unsubscribe-auth/AQOsq9DVutsKMDAlvMg2k5-E4ruyQ_dJks5rnsibgaJpZM4J39Gv .

-- Doreva Belfiore

Digital Projects Librarian Co-Project Manager Digital Library Initiatives PA Digital Temple University Libraries www.padigital.org 215-204-4942 (P) info@padigital.org 215-204-3681 (F)

dorevabelfiore-temple commented 7 years ago

I am not seeing any more encoding errors in the Resque. Thanks!

What we are seeing are unrelated issues that are now reported as issues

166 bad trackback URL for Islandora on "View Object" and also #172 bad

thumbnails. The thumbnail had been working but when I reingest it now is not.

Unfortunately, we are finding some identifiers that are duplicated among collections as well. :-(

Gabe & Rachel have more info.

Thanks!

--Doreva

On Tue, Mar 21, 2017 at 8:19 AM, Doreva Belfiore tue50858@temple.edu wrote:

Tested 2 and they looked good. Running another 2 tests today.

On Mon, Mar 20, 2017 at 2:42 PM, Chad Nelson notifications@github.com wrote:

We pushed a change that should trap and quarantine these incompatible character encoding files. So please try ingesting these collections again.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tulibraries/dplah/issues/134#issuecomment-287858455, or mute the thread https://github.com/notifications/unsubscribe-auth/AQOsq9DVutsKMDAlvMg2k5-E4ruyQ_dJks5rnsibgaJpZM4J39Gv .

-- Doreva Belfiore

Digital Projects Librarian Co-Project Manager Digital Library Initiatives PA Digital Temple University Libraries www.padigital.org 215-204-4942 <(215)%20204-4942> (P) info@padigital.org 215-204-3681 <(215)%20204-3681> (F)

-- Doreva Belfiore

Digital Projects Librarian Co-Project Manager Digital Library Initiatives PA Digital Temple University Libraries www.padigital.org 215-204-4942 (P) info@padigital.org 215-204-3681 (F)