srl295 / srl-unicode-proposals

Unicode proposals
Other
4 stars 1 forks source link

[UCD-GPG] Data set integrity: signing list of all files, including subdirectories #12

Open behnam opened 6 years ago

behnam commented 6 years ago

Unicode contains various data sets (UCD, IDNA, Emoji, CLDR), some of which have files in subdirectories.

For example, as of Unicode 10.0.0, UCD contains auxiliary/ and extracted/.

The SHASUM solution mentioned in the proposal can be used as the list of files published in a data set. Specially, since if the file is not signed, it should not be considered published in the first place.

The SHASUM, if only including files in the same directory, does not indicate if there are any subdirectories as part of a data set. Basically, leaving a broken chain of trust.

One suggestion is to make sure there's only one SHASUM in the root directory of each data set, which has paths to all the files considered part of the data set. This ensures integrity of the data set.

Second suggestion is, to make it mandatory for data sets to have SHASUM, which would be the single source of truth for files in the data set. Without an SHASUM file (with only per-file signatures), there's no way to tell if the data set is complete or not.