jkunze / bagitspec

31 stars 11 forks source link

Unicode Normalization? #13

Open edsu opened 8 years ago

edsu commented 8 years ago

The BagIt specification lets you specify that UTF-8 encoding be used in tag manifests. But it doesn't appear to assume a particular normalization form.

I have a problem where files are bagged and transferred from an OS X filesystem (which uses NFD) and are copied to Linux (which uses NFC). During validation the NFC normalized form from the filesystem is compared against the NFD normalized form from the manifest and validation fails.

Should a particular normalization form (NFC?) be assumed for unicode encodings?

edsu commented 8 years ago

Perhaps the spec doesn't need to specify as long as applications normalize both the path from the manifest and the path from the filesystem when comparing?

acdha commented 8 years ago

I think trying to mandate a normalization form would be hard but perhaps there should be a prominent guide for implementors? We could follow with enhancement requests for the known open source projects.

edsu commented 8 years ago

:+1: mandating seems hard (fruitless), but a note to implementors to pick one when comparing the filesystem paths against the manifest filenames seems like a good idea?

acdha commented 7 years ago

@edsu how does the proposed recommendation in https://github.com/loc-rdc/bagitspec/pull/1/ and especially https://github.com/loc-rdc/bagitspec/pull/1/commits/f898aff4ee89c441ee6931f708d942551ad549a4 sound?