ucldc / cinco

Monorepo for all things Online Archive of California, version 5
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Stage additional representative finding aids and institution data in ArcLight demo #8

Closed christinklez closed 3 weeks ago

christinklez commented 1 month ago

Proposed list of additional representative sample EADs (and some additional institutions): https://docs.google.com/spreadsheets/d/1W77H91RVSuBZ6rv6nVTHC-NiMbX12hfoBxBN8lPnvbM/edit?usp=sharing

everreau commented 1 month ago

African American Museum & Library at Oakland is complete:

https://oac.cdlib.org/findaid/ark:/13030/c83f4vmv https://oac.cdlib.org/findaid/ark:/13030/c83f4vcq TitleNotFound: <unittitle/> or <unitdate/> must be present for all documents and components

everreau commented 1 month ago

Attaching the full error output of one of the errors from bancroft. It looks like the errors are shown in context in the full EAD outputm70_89_cubanc_error.txt

aturner commented 1 month ago

@everreau it looks like the UCB Bancroft finding aid error for m70_89_cubanc.xml m70_89_cubanc_error.txt is specifically related to the <c0#> components lacking IDs needed by ArcLight. But ArcLight should be minting / assigning the IDs as needed per https://github.com/projectblacklight/arclight/wiki/Indexing-EAD-in-ArcLight . So hopefully they are non-blocker errors -- and this finding aid should index / display fine 🤞

everreau commented 1 month ago

@aturner if you scroll down all the way to the bottom of the error log you'll see a ruby error this is related to the title not found error. Just above that ruby error you'll see the full xml output of the finding aid with error messages inserted at the location of the problem. I included the enitre output of the index just in case there as additional information in all that output that was also useful.

everreau commented 1 month ago

There were issues in the indexing the bancroft finding aids. There may be more than one problem but I have narrowed it down completely yet.

I disabled the error for any component that had a missing id. This will probably need to be fixed later but it was preventing me from making progress.

I found 1 fairly widespread problem where the EAD id is not consistent and therefore the indexer is grabbing a long string that has invalid characters which prevents arclight from being able to display this finding aid. example:

Good:

p1959_003_cubanc.xml
<ead:eadid countrycode="US" identifier="ark:/13030/tf7d5nb8zx" mainagencycode="CU-BANC" publicid="-//University of California, Berkeley::Bancroft Library//TEXT (US::CU-BANC::BANC PIC 1959.003--PIC::1934 International Longshoremen&#39;s Association and general strikes of San Francisco)//EN">p1959_003_cubanc.xml</ead:eadid>

Bad:

mcb891_cubanc.xml
<eadid identifier="ark:/13030/tf3q2nb06d" mainagencycode="CU-BANC" publicid="-//University of California, Berkeley::Bancroft Library//TEXT (US::CU-BANC::BANC MSS C-B 891::Inventory of the Abraham Darlington papers)//EN" countrycode="us">PUBLIC "-//University of California, Berkeley::Bancroft Library//TEXT (US::CU-BANC::BANC MSS C-B 891::Inventory of the Abraham Darlington papers)//EN" "mcb891_cubanc.xml"</eadid>

Inside the eadid tags should only contain the filename portion. I fixed the example above by hand and it's displaying now:

http://ec2-34-210-75-34.us-west-2.compute.amazonaws.com/catalog/mcb891_cubanc-xml

I have found enough instance of this that it is probably worth it to write a script to find and fix them.

everreau commented 1 month ago

I found one other finding aid that is not displaying correctly but doesn't match the above pattern (and throws a different error)

http://ec2-34-210-75-34.us-west-2.compute.amazonaws.com/catalog/earthfire-xml_c1007370

This one is a little odd because it looks like this is actually a subcomponent of the full EAD but it's displaying it in a list of EAD results.

everreau commented 1 month ago

ok, I have slight theory on this one now. In the container list this finding aid expresses lists of items like this:

<c03 level="collection" id="c5000001">
    <did>
        <unitid type="callno">BANC PIC 1933.007--ALB</unitid>
        <unittitle>New San Francisco: Three Years after the Great Conflagration: Photographs, 1909</unittitle>
        <repository>The Bancroft Library</repository>
    </did>
   <c04 level="item" id="c5000465">

Most other finding aids don't use level=collection in the container list they usually use series or subseries. I think this is probably misinterpreted but the indexer.

everreau commented 1 month ago

I changed this level from collection to file as that seems correct based on these docs: https://wikis.mit.edu/confluence/display/ARCHIVESPROCESSING/Level+of+Description

That seems to have fixed it.

everreau commented 1 month ago

There are two finding aids that are not parsable by python:

BANC_MSS_C-B_1018_ead.xml mcb79_fortransformation_cubanc.xml

There are also a variety of EADID consistencies, some have nothing in this field, some have typos, etc. I will fix all of them to use the filename as the eadid.

everreau commented 1 month ago

attaching list of bancroft eads that did not have the filename as the EADID:

bancroft_ead_bad_id.txt

everreau commented 1 month ago

calanhm only has one error regarding a date range. I'm attaching the full error message. calanhm.txt

everreau commented 1 month ago

Error in AAMLO:

Loading eads/aamlo/00002.xml into index...
E, [2024-10-24T15:35:36.475647 #552662] ERROR -- : Unexpected error on record <source_id:al_b8d1d23030b89f751df552d6a51eef0ec4825383 output_id:00002_al_b8d1d23030b89f751df552d6a51eef0ec4825383>
    while executing (to_field "normalized_title_ssm" at /apps/arclight/.rbenv/versions/3.1.4/lib/ruby/gems/3.1.0/gems/arclight-1.1.2/lib/arclight/traject/ead2_component_config.rb:123)

    Record: <c04 level="item" id="al_b8d1d23030b89f751df552d6a51eef0ec4825383">
<did>
<container type="box-folder" label="Box ">3 : 14 </container>
<unittitle/>
<unitdate/>
</did>
</c04>
    Exception: Arclight::Exceptions::TitleNotFound: <unittitle/> or <unitdate/> must be present for all documents and components
    /apps/arclight/.rbenv/versions/3.1.4/lib/ruby/gems/3.1.0/gems/arclight-1.1.2/lib/arclight/normalized_title.rb:28:in `normalize'
everreau commented 1 month ago

aamlo.txt attaching the full error report from AAMLO. There are 34 errors all seem to be title errors but some manifest a little differently and about 9 of them don't seem to have a lot of context with them.

everreau commented 1 month ago

attaching full error report from EDA. There are 24 errors a mix of "Range inverted" and title errors.

eda.txt

everreau commented 1 month ago

un-parseable glbthistory:

williamstruzenberg.xml glbths_2001-04.xml glbths_1998-08.xml fernandoaguayogarciaead.xml glbths_1998-07.xml glbths_1996-02.xml glbths_1998-48.xml glbths_1997-24.xml glbths_1999-32.xml gidlow.xml

everreau commented 1 month ago

janm parsing errors:

janm_gormankelley.xml KoganYoshizumi.xml janm_JACL-DC.xml janm_tetsuotoyama.xml janm_williammmarutani.xml janm_JACL-PSW.xml minetafa.xml janm_kondofamily.xml janm_paulwatanabe.xml janm_jirokozai.xml janm_georgefujii.xml EmeryFast.xml janm_charlespalmerleetest.xml janm_muraifamily.xml Frost.xml janm_ruthleppman.xml janm_williamhohri.xml janm_JAMA.xml GeorgeHoshida.xml janm_herbertnicholson.xml janm_CLPEF.xml Minerichupload.xml janm_johnbonomi.xml janm_takeshiban.xml

everreau commented 1 month ago

stanford parsing error:

m1522.xml m1176.xml m0108.xml m2680.xml

everreau commented 3 weeks ago

@aturner I think a lot of the UCI UA finding aids should validate with EAS. Here's one: http://ec2-34-210-75-34.us-west-2.compute.amazonaws.com/catalog/as004-xml

everreau commented 3 weeks ago

no additional errors identified in hoover.