jkunze / bagitspec

31 stars 11 forks source link

Proposed changes for 1.0 (updated source repo) #19

Open acdha opened 6 years ago

acdha commented 6 years ago

This is a replacement for #17 reflecting the move from the old loc-rdc organization to the primary LibraryOfCongress. The primary notable change from #17 is restoring the fetch.txt section following discussion with @jkunze, @dbrunton, and @johnscancella.

stain commented 6 years ago

Don't forget to update when merging:

    <date day="6" month="December" year="2016"/>
stain commented 6 years ago

It is unclear in section bagit.txt

The "bagit.txt" tag file MUST consist of exactly two lines in this order:

BagIt-Version: M.N
Tag-File-Character-Encoding: UTF-8

M.N identifies the BagIt major (M) and minor (N) version numbers, and UTF-8 identifies the character set encoding used by the tag files. The bag declaration MUST be encoded in UTF-8, and MUST NOT contain a byte-order mark (BOM) [RFC3629].

This can be read as that the bag MUST always have UTF-8 as Tag-File-Character-Encoding - which makes considerations in 2.3. Text Tag File Format moot.

But on second reading do I understand that you still want to allow any encoding (without saying where that encoding name is defined) - and that it is the bagit.txt file itself that is the only one that must be UTF-8? (why not ASCII?)

I would not mind voting for fixed UTF-8 for Bagit 1.0. This has become the norm for most formats like XML, JSON. If we allow other character encoding we must say which registry we refer to, otherwise arbitrary encoding strings like "code page 865" would be allowed.

Edit: See https://github.com/LibraryOfCongress/bagit-spec/pull/14