Closed christinklez closed 3 weeks ago
African American Museum & Library at Oakland is complete:
https://oac.cdlib.org/findaid/ark:/13030/c83f4vmv
https://oac.cdlib.org/findaid/ark:/13030/c83f4vcq
TitleNotFound: <unittitle/> or <unitdate/> must be present for all documents and components
Attaching the full error output of one of the errors from bancroft. It looks like the errors are shown in context in the full EAD outputm70_89_cubanc_error.txt
@everreau it looks like the UCB Bancroft finding aid error for m70_89_cubanc.xml
m70_89_cubanc_error.txt is specifically related to the <c0#> components lacking IDs needed by ArcLight. But ArcLight should be minting / assigning the IDs as needed per https://github.com/projectblacklight/arclight/wiki/Indexing-EAD-in-ArcLight . So hopefully they are non-blocker errors -- and this finding aid should index / display fine 🤞
@aturner if you scroll down all the way to the bottom of the error log you'll see a ruby error this is related to the title not found error. Just above that ruby error you'll see the full xml output of the finding aid with error messages inserted at the location of the problem. I included the enitre output of the index just in case there as additional information in all that output that was also useful.
There were issues in the indexing the bancroft finding aids. There may be more than one problem but I have narrowed it down completely yet.
I disabled the error for any component that had a missing id. This will probably need to be fixed later but it was preventing me from making progress.
I found 1 fairly widespread problem where the EAD id is not consistent and therefore the indexer is grabbing a long string that has invalid characters which prevents arclight from being able to display this finding aid. example:
Good:
p1959_003_cubanc.xml
<ead:eadid countrycode="US" identifier="ark:/13030/tf7d5nb8zx" mainagencycode="CU-BANC" publicid="-//University of California, Berkeley::Bancroft Library//TEXT (US::CU-BANC::BANC PIC 1959.003--PIC::1934 International Longshoremen's Association and general strikes of San Francisco)//EN">p1959_003_cubanc.xml</ead:eadid>
Bad:
mcb891_cubanc.xml
<eadid identifier="ark:/13030/tf3q2nb06d" mainagencycode="CU-BANC" publicid="-//University of California, Berkeley::Bancroft Library//TEXT (US::CU-BANC::BANC MSS C-B 891::Inventory of the Abraham Darlington papers)//EN" countrycode="us">PUBLIC "-//University of California, Berkeley::Bancroft Library//TEXT (US::CU-BANC::BANC MSS C-B 891::Inventory of the Abraham Darlington papers)//EN" "mcb891_cubanc.xml"</eadid>
Inside the eadid tags should only contain the filename portion. I fixed the example above by hand and it's displaying now:
http://ec2-34-210-75-34.us-west-2.compute.amazonaws.com/catalog/mcb891_cubanc-xml
I have found enough instance of this that it is probably worth it to write a script to find and fix them.
I found one other finding aid that is not displaying correctly but doesn't match the above pattern (and throws a different error)
http://ec2-34-210-75-34.us-west-2.compute.amazonaws.com/catalog/earthfire-xml_c1007370
This one is a little odd because it looks like this is actually a subcomponent of the full EAD but it's displaying it in a list of EAD results.
ok, I have slight theory on this one now. In the container list this finding aid expresses lists of items like this:
<c03 level="collection" id="c5000001">
<did>
<unitid type="callno">BANC PIC 1933.007--ALB</unitid>
<unittitle>New San Francisco: Three Years after the Great Conflagration: Photographs, 1909</unittitle>
<repository>The Bancroft Library</repository>
</did>
<c04 level="item" id="c5000465">
Most other finding aids don't use level=collection
in the container list they usually use series
or subseries
. I think this is probably misinterpreted but the indexer.
I changed this level from collection to file as that seems correct based on these docs: https://wikis.mit.edu/confluence/display/ARCHIVESPROCESSING/Level+of+Description
That seems to have fixed it.
There are two finding aids that are not parsable by python:
BANC_MSS_C-B_1018_ead.xml mcb79_fortransformation_cubanc.xml
There are also a variety of EADID consistencies, some have nothing in this field, some have typos, etc. I will fix all of them to use the filename as the eadid.
attaching list of bancroft eads that did not have the filename as the EADID:
calanhm only has one error regarding a date range. I'm attaching the full error message. calanhm.txt
Error in AAMLO:
Loading eads/aamlo/00002.xml into index...
E, [2024-10-24T15:35:36.475647 #552662] ERROR -- : Unexpected error on record <source_id:al_b8d1d23030b89f751df552d6a51eef0ec4825383 output_id:00002_al_b8d1d23030b89f751df552d6a51eef0ec4825383>
while executing (to_field "normalized_title_ssm" at /apps/arclight/.rbenv/versions/3.1.4/lib/ruby/gems/3.1.0/gems/arclight-1.1.2/lib/arclight/traject/ead2_component_config.rb:123)
Record: <c04 level="item" id="al_b8d1d23030b89f751df552d6a51eef0ec4825383">
<did>
<container type="box-folder" label="Box ">3 : 14 </container>
<unittitle/>
<unitdate/>
</did>
</c04>
Exception: Arclight::Exceptions::TitleNotFound: <unittitle/> or <unitdate/> must be present for all documents and components
/apps/arclight/.rbenv/versions/3.1.4/lib/ruby/gems/3.1.0/gems/arclight-1.1.2/lib/arclight/normalized_title.rb:28:in `normalize'
aamlo.txt attaching the full error report from AAMLO. There are 34 errors all seem to be title errors but some manifest a little differently and about 9 of them don't seem to have a lot of context with them.
attaching full error report from EDA. There are 24 errors a mix of "Range inverted" and title errors.
un-parseable glbthistory:
williamstruzenberg.xml glbths_2001-04.xml glbths_1998-08.xml fernandoaguayogarciaead.xml glbths_1998-07.xml glbths_1996-02.xml glbths_1998-48.xml glbths_1997-24.xml glbths_1999-32.xml gidlow.xml
janm parsing errors:
janm_gormankelley.xml KoganYoshizumi.xml janm_JACL-DC.xml janm_tetsuotoyama.xml janm_williammmarutani.xml janm_JACL-PSW.xml minetafa.xml janm_kondofamily.xml janm_paulwatanabe.xml janm_jirokozai.xml janm_georgefujii.xml EmeryFast.xml janm_charlespalmerleetest.xml janm_muraifamily.xml Frost.xml janm_ruthleppman.xml janm_williamhohri.xml janm_JAMA.xml GeorgeHoshida.xml janm_herbertnicholson.xml janm_CLPEF.xml Minerichupload.xml janm_johnbonomi.xml janm_takeshiban.xml
stanford parsing error:
m1522.xml m1176.xml m0108.xml m2680.xml
@aturner I think a lot of the UCI UA finding aids should validate with EAS. Here's one: http://ec2-34-210-75-34.us-west-2.compute.amazonaws.com/catalog/as004-xml
no additional errors identified in hoover.
Proposed list of additional representative sample EADs (and some additional institutions): https://docs.google.com/spreadsheets/d/1W77H91RVSuBZ6rv6nVTHC-NiMbX12hfoBxBN8lPnvbM/edit?usp=sharing