mjordan / bagit_indexer

Proof-of-concept tool for extracting data from Bags and indexing it in Elasticsearch
The Unlicense
2 stars 0 forks source link

Index bag-info.txt fields that are moved to a METS file when a Bag is processed by Archivematica #14

Open ubercoreydavis opened 5 years ago

ubercoreydavis commented 5 years ago

When Archivematica processes a zipped bag with a bag-info.txt file in it, the information is transferred to a METS file, and can be found in a predictable location in the analog/digital source metadata sub-section of the administrative metadata section . See attached as an example (the original bag-info.txt file and the METS file generated by Archivematica are included).

The tool that extracts data from the bags and indexes them will need to be aware of the location of the data we want to index is (e.g., in a METS file at the analog/digital source metadata sub-section of the administrative metadata section , or if that data doesn't exist, in the bag-info.txt file).

Using something like the proof of concept BagIt Indexer, example logic would be: If there is a file named "METS.xml" at the root of the Bag's /data directory, look for data at /data/METS.xml// and index it so each element is in a searchable field; if "METS.xml" doesn't exist, index the fields in /bag-info.txt. (We'll need a third fallback option here, in case there is a METS.xml file but it doesn't contain /.) Pretty standard stuff. The gotcha here is that if we add a third deposit type (say a non-bag deposit), the indexer script would need to know where to get the relevant data for that deposit type. Given today's software development practices, this sort of extensibility can be handled by using a plugin architecture where a different plugin detects the presence of the desired data and extracts it, and then passes it off to the indexing engine.

Of course, using the same names for the METS elements and bag-info.txt fields will result in better queries. However, plugins can also map source fields to a common field name for indexing purposes.