jkunze / bagitspec

31 stars 11 forks source link

More complex data structures for bag-info.txt #18

Open nkrabben opened 7 years ago

nkrabben commented 7 years ago

I'm running tools that update our bags between receipt and ingest. I would like to record PREMIS events for each of these, and bag-info.txt seems like the most appropriate location. However, bag-info.txt can only contain key: value lines.

My PREMIS implementation would use a YAML format like this:

Bagging-Date: 2017-06-22
Payload-Oxum: 12345.67
Bag-History: 
 - Event-Date-Time: 20170622155934EDT
    Event-Detail-Information: "md_1.json, md_2.json, md_3.json updated"
    Event-Outcome: Pass
    Event-Outcome-Detail-Note: "Bag no longer valid"
    Event-Type: "Payload Metadata Update"
 - Event-Date-Time: 20170622160001EDT
    Event-Detail-Information: "Hashes updated as follows filename, previous hash, new hash md_1.json, 09678de75874f324793a8cafd2db4ea3, 8066e52b17095446e41f57fdd88fe405 ...\n"
    Event-Outcome: Pass
    Event-Type: "Bag Hash Update"
 - Event-Date-Time: 2017062216045EDT
    Event-Detail-Information: "Previous 0xum - 12345.67 New 0xum - 12344.67\n"
    Event-Outcome: Pass
    Event-Type: "Bag 0xum Update"

The simpler route would be to create a standalone bag-premis.txt for this information, but I wanted to see if there was interest in incorporating a more complex structured data format into bag-info.txt

mjordan commented 7 years ago

@nkrabben coincidentally I'm currently playing with something I call BagItLD. Just some half-baked ideas at this point, but they do suggest one way of extending bag-info.txt tags beyond simple key:value pairs.

mjordan commented 7 years ago

If you converted your YAML to JSON, would it be illegal to store the resulting JSON as the value of a bag-info.txt tag like PREMIS-document?

nkrabben commented 7 years ago

There is the recommendation against long lines https://github.com/jkunze/bagitspec/blob/master/bagit.xml#L462 So it's not illegal, but I think that would also impact readability.

My other hesitation is that I'd rather have native parsing of the data so instead of needing to do something like (using bagit-python)

if 'premis' in bag.info.keys():
   premis_events = json.loads(bag.info['premis'])
else:
   premis_events = []
premis_events.append(new_event)
premis_string = json.dumps(premis_events)
bag.info['premis'] = premis_string.replace("},{", },\r\n  {)

I could work with it as

if not 'premis' in bag.info.keys():
   premis_events = []
bag.info['premis'].append(new_event)

and not have to worry about human readability as much.

mjordan commented 7 years ago

True, JSON as values in tags is against the line-length recommendation and is very ugly.

acdha commented 7 years ago

What do you think about simply shipping a separate .json file which would contain the data which cannot be represented as key-value pairs, with perhaps core data and a pointer in the standard bag-info.txt file? Backwards compatibility makes anything else complicated but the spec explicitly allows other arbitrary tag files and it'd be a lot easier to iterate this way.

nkrabben commented 7 years ago

Yeah, that's likely the easiest way to go at this moment.

Is there any value in making a more formal recommendation on data formats in the 2.2.4 Other Tag Files section? https://github.com/jkunze/bagitspec/blob/master/bagit.xml#L609

The DPN tag file follows the bag-info format, which is interpretable as YAML. The manifest and fetch files, all use a white-space delimited CSV format.

Would it be useful to recommend or reiterate these formats in 2.2.4 when creating custom tag file formats?

acdha commented 7 years ago

I think it's tedious to update an RFC with various formats but perhaps we should reiterate a recommendation about portability and link to a wiki page or other place where people could track various commonly used files?

nkrabben commented 7 years ago

I'll do some more research to see if others are using the bag-info and manifest formats in their bags, or if there are other common data formats in use.

stain commented 6 years ago

You may want to look at our ResearchObject bagit profile, where we use JSON-LD manifest for a richer OAI-ORE and PROV-based manifest under metadata/manifest.json. You can also have arbitrary annotations linked in using the W3C Web Annotation Model.

See our paper I’ll Take That to Go (doi:10.1109/BigData.2016.7840618) for details, bdbag for tooling, ark:/57799/b91w9r for a minimal example bag, which has this manifest.json.

(The Turtle converted using the arcp URI scheme to mint absolute URIs within a BagIt)

acdha commented 6 years ago

@stain That seems like something which would make a good reference in #19 — since the spec allows arbitrary top-level tag files it would be nice to recognize community efforts like that as the suggested approach for more complex metadata.

stain commented 6 years ago

I was undecided about metadata/manifest.json or just manifest.json at top level - but left it one level in to avoid confusion with manifest-sha1.txt etc, at the cost of having to use ../data/ references.

Using absolute URIs like arcp://uuid,f242276b-89fa-4f8b-96ef-265b8ec81230/data/file.txt (combined with a @base) would avoid the ../ references. It might be important then to have a clearly defined URI in the bag-info.txt for generating the corresponding arcp base URI for a bag.

BTW, I think metadata/ in BagIt convention is getting traction (although you could argue any tag file is metadata?); DataCrate has also in its proposed 0.2 draft moved to metadata/datacite.xml etc and metadata/ has been used also by several other archiving initiatives that RDA have identified and listed - most of which are based on BagIt.