Open nkrabben opened 7 years ago
@nkrabben coincidentally I'm currently playing with something I call BagItLD. Just some half-baked ideas at this point, but they do suggest one way of extending bag-info.txt tags beyond simple key:value pairs.
If you converted your YAML to JSON, would it be illegal to store the resulting JSON as the value of a bag-info.txt tag like PREMIS-document
?
There is the recommendation against long lines https://github.com/jkunze/bagitspec/blob/master/bagit.xml#L462 So it's not illegal, but I think that would also impact readability.
My other hesitation is that I'd rather have native parsing of the data so instead of needing to do something like (using bagit-python)
if 'premis' in bag.info.keys():
premis_events = json.loads(bag.info['premis'])
else:
premis_events = []
premis_events.append(new_event)
premis_string = json.dumps(premis_events)
bag.info['premis'] = premis_string.replace("},{", },\r\n {)
I could work with it as
if not 'premis' in bag.info.keys():
premis_events = []
bag.info['premis'].append(new_event)
and not have to worry about human readability as much.
True, JSON as values in tags is against the line-length recommendation and is very ugly.
What do you think about simply shipping a separate .json file which would contain the data which cannot be represented as key-value pairs, with perhaps core data and a pointer in the standard bag-info.txt file? Backwards compatibility makes anything else complicated but the spec explicitly allows other arbitrary tag files and it'd be a lot easier to iterate this way.
Yeah, that's likely the easiest way to go at this moment.
Is there any value in making a more formal recommendation on data formats in the 2.2.4 Other Tag Files section? https://github.com/jkunze/bagitspec/blob/master/bagit.xml#L609
The DPN tag file follows the bag-info format, which is interpretable as YAML. The manifest and fetch files, all use a white-space delimited CSV format.
Would it be useful to recommend or reiterate these formats in 2.2.4 when creating custom tag file formats?
I think it's tedious to update an RFC with various formats but perhaps we should reiterate a recommendation about portability and link to a wiki page or other place where people could track various commonly used files?
I'll do some more research to see if others are using the bag-info and manifest formats in their bags, or if there are other common data formats in use.
You may want to look at our ResearchObject bagit profile, where we use JSON-LD manifest for a richer OAI-ORE and PROV-based manifest under metadata/manifest.json
. You can also have arbitrary annotations linked in using the W3C Web Annotation Model.
See our paper I’ll Take That to Go (doi:10.1109/BigData.2016.7840618) for details, bdbag for tooling, ark:/57799/b91w9r for a minimal example bag, which has this manifest.json.
(The Turtle converted using the arcp URI scheme to mint absolute URIs within a BagIt)
@stain That seems like something which would make a good reference in #19 — since the spec allows arbitrary top-level tag files it would be nice to recognize community efforts like that as the suggested approach for more complex metadata.
I was undecided about metadata/manifest.json
or just manifest.json
at top level - but left it one level in to avoid confusion with manifest-sha1.txt
etc, at the cost of having to use ../data/
references.
Using absolute URIs like arcp://uuid,f242276b-89fa-4f8b-96ef-265b8ec81230/data/file.txt
(combined with a @base
) would avoid the ../
references. It might be important then to have a clearly defined URI in the bag-info.txt for generating the corresponding arcp base URI for a bag.
BTW, I think metadata/
in BagIt convention is getting traction (although you could argue any tag file is metadata?); DataCrate has also in its proposed 0.2 draft moved to metadata/datacite.xml
etc and metadata/
has been used also by several other archiving initiatives that RDA have identified and listed - most of which are based on BagIt.
I'm running tools that update our bags between receipt and ingest. I would like to record PREMIS events for each of these, and bag-info.txt seems like the most appropriate location. However, bag-info.txt can only contain
key: value
lines.My PREMIS implementation would use a YAML format like this:
The simpler route would be to create a standalone bag-premis.txt for this information, but I wanted to see if there was interest in incorporating a more complex structured data format into bag-info.txt