mjordan / bagit_indexer

Proof-of-concept tool for extracting data from Bags and indexing it in Elasticsearch
The Unlicense
2 stars 0 forks source link

Use sha1 hashes as bag identifiers #10

Closed mjordan closed 6 years ago

mjordan commented 6 years ago

Currently the indexer uses the bag's filename as its ID. For example, for a bag at the location /mnt/storage/bag_567.zip, the ID will be 'bag_567'. This ID is only unique for bags within a given directory, and if we rename a file, any association between its ID and its name is lost.

To work around these limitations, we should use the bag's sha1 checksum as its ID, since the probability of a collision is extremely low. If the probability of collisions is low enough for Git to use sha1 hashes as unique IDs, it's good enough for an index of bags. Perhaps we can even allow users to enter the first x digits of a sha1 hash in find queries.

The main disadvantage is the bag filenames are much more human-readable than hashes.

mjordan commented 6 years ago

Fixed with a30ce7e.