Sample - work in progress

alifabeta commented 10 years ago

For everyone watching this project, I've gotten an import into ArchivesSpace to work... sort of.

http://sandbox.archivesspace.org:8081/repositories/2/resources/28 (note that the ArchivesSpace sandbox gets reset periodically, so this link might be short-lived)

Compare to: http://hdl.handle.net/1903.1/2994

It imported once the

<extent>

tags were added inside the

<physdesc>

tags. But there are still lots of kinks to be ironed out. Short list of problems:

[ ] Multiple abstracts that are all identical - because abstracts are put in multiple times for ArchivesUM subject guides. Need to remove extras, put subject categorizations elsewhere.
[ ] Headings repeated - for example "Biography" is the heading of the section as well as the first line of the text for that section. Need to remove first line of each section in the collection description.
[ ] Duplication and Copyright Information - can the contact url be an actual link? Look into this.
[ ] Arrangement - series are listed in a poorly-formatted text and then in a nice list - only need to be listed once.
[ ] Related material - refers to subject guides, which obviously aren't here. So take that out.
[x] Components - on the series level, each series is listed twice. The second listing is the one we want (containing the folders and items for each series), so we need to eliminate the first listing.

alifabeta commented 10 years ago

Okay, you can look at this: http://sandbox.archivesspace.org:8081/repositories/2/resources/32. Notice that it's much cleaner than the Bencriscutto FA linked above. I cleaned up the code manually to make sure that we're aiming for the correct results. I also created a pseudocode document and the beginnings of the python script. So now all we need to do is code.

ghost commented 10 years ago

@alifabeta, I see two potential areas of coding:

new python program to cleanup the data in-place in MS Access prior to extract to EAD
modify java extract code, https://github.com/umd-lib/ead-db-convert, to produce different EAD output

how much of the work above do you think is for the python program and how much for the EAD converter? It looks like you performed manual EAD modifications for your ArchivesSpace sandbox work. Is that correct?

jennielevineknies commented 10 years ago

@wallberg-umd - I am intrigued at the idea of modifying the Java extract code. Hadn't really thought of that approach before. However, since I am relatively familiar with it, I could do a little testing, playing around with requirements that Amanda works up. Ultimately, that might be a faster way to go. I'd need help making sure that the java code is set up correctly on my machine, etc. for testing. For example,@alifabeta - I can't remember exacly what the extent issue is, but I think it's along the lines of the current EAD is coded as:

<physdesc label="Size of the Collection" encodinganalog="300$a">140 items</physdesc>

and in ArchivesSpace it needs to be something like:

<physdesc><extent>140 items</extent></physdesc>

(I would ask if we need to retain these labels and marc encoding tags for ArchivesSpace - if not, we could easily remove them from the converter).

Anyway, a change like that involves modifying this file:

https://github.com/umd-lib/ead-db-convert/blob/develop/src/org/mith/ead/data/DataConvertor.java

And changing

didString = didString +"<physdesc label=\"Size of the Collection\" encodinganalog=\"300$a\">"+ rsArch.getString("physdesc")+"</physdesc>"; to didString = didString +"<physdesc><extent>"+ rsArch.getString("physdesc")+"</extent></physdesc>";

Voila! Any EAD you convert using the converter will be ArchivesSpace compliant.

So, those are the kinds of changes that are easy to make in the converter, and could probably save us a lot of time... I would need to sit with Ben in order to make the first official changes and overcome my fear of screwing everything up :)

ghost commented 10 years ago

I created new feature/ArchivesSpace branch for ead-db-convert, see https://github.com/umd-lib/ead-db-convert/tree/feature/ArchivesSpace

umd-coding-workshop / taming-the-beast

Sample - work in progress #5