gwu-libraries / aspace-barcodes

Populates ArchivesSpace with barcodes from Alma.
0 stars 0 forks source link

Review sample of matches for accuracy #2

Open kilahimm opened 1 year ago

dolsysmith commented 1 year ago

See results of preliminary matching on this spreadsheet.

  1. The tab 1-to-1-matches-with-call-number shows top containers that matched a single Alma item/barcode after testing against both the item enumeration and holdings-level call number.
  2. The tab duplicates shows top containers that matched on more than one item/barcode. No priority was given, so if a top container matched one item on the enumeration and the call number, and a second item on just the enumeration, both matches are shown in this sheet.
  3. Sheet4 shows the collections and the number of top containers with duplicate matches.

Questions

  1. Focusing on the collections with many duplicate matches, are there consistent rules that can reduce the amount of duplication? For instance, if a match is made on the enumeration and the series call number, should that be considered the definitive match?
  2. Do the items in 1-to-matches-with-call-number seem correctly matched? It may be useful to evaluate a random sampling here. 10% would be approximately 800 rows; if divided among 8 people, each person would need to check 100 rows.
DaltonAlves commented 1 year ago

My thoughts on your questions:

  1. My gut-reaction is that we might want to skip these for now in the spirit of moving forward. These duplicates are often occuring on resource IDs with known issues. Some of them may require clean up in Aspace (see below). Others might require some additional logic work to get matches that we feel comfortable with.
  2. So far the 1-to-matches-with-call-number all seem good. The few issues that I found aren't really issues that will impact our project. For example, in RG002, Office of the President records, I found that there are multiple TLC containers that represent the same physical box. We've (correctly) matched 1 barcode to multiple TLC records. Pushing the barcode to all of the TLC records would be fine since they all represent the same physical box with the same barcode. However, we should consult with Jen -- this may be something that we want to cleanup instead of just ignoring.

Diving deeper into the duplicates, here are my findings so far:

Bad Aspace Data - Incorrect linkage between AOs and TLCs Examples of this are most evident in the duplicates for the David A Clarke records. See this TLC record in the Aspace PUI. Notice how folder #s are repeated. AOs in series 2 were incorrectly linked to TLCs from series 1. This messes with the TLC component field that we are using to match w/ Alma enum/callnumbers. There's probably some logic we could come up with to over-come this issue, but we should probably address the core issue instead.

Unprocessed boxes The Greater Washington Board of Trade records are a good example of this. This series, Unprocessed 1997 Accretion is represented by this holdings in Alma, 22638551460004107. The unprocessed boxes represented in Alma do not have corresponding TLCs in Aspace. Many (all?) of the duplicates for this resource ID are caused by boxes from the described portions of the collection matching with boxes from that unprocessed accretion because the box numbers are not unique. A simple fix would be to exclude the items on the 1997 UP holdings.

Implied Series # Sometimes the enum or call number in Alma is just 'Box X' with no series information. In some cases, we can assume that these are series 1. For example, this duplication is caused by that scenario:

32882013264968 Box 1   22638529140004107 B 1 RG0085 Series 1 27668 636 Faculty Women's Club records  
32882017682108 Series 3 Box 1   22638529140004107 B 1 RG0085 Series 1 27668 636 Faculty Women's Club records  
32882018610033 Box 1 Series 2   22638529140004107 B 1 RG0085 Series 1 27668 636 Faculty Women's Club records

This seems very risky though. We may want to manually confirm when this scenario happens.