okfn-brasil / serenata-toolbox

📦 pip module containing code shared across Serenata de Amor's projects | ** Este repositório não recebe atualizações frequentes **
MIT License
154 stars 69 forks source link

Add checksum file of datasets to the bucket #198

Open willianpaixao opened 6 years ago

willianpaixao commented 6 years ago

A simple plain text file containing the MD5SUM of the datasets generated would improve both the credibility of the data and later speed of re-processing unchanged data.

cuducos commented 6 years ago

I've never implemented something like that… so the idea is to in addition to each file in our storage, add another file with the same name plus an extension .checksum with the MD5SUM, am I right?

willianpaixao commented 6 years ago

I've never implemented something like that

Either one simple bash command or using Python's hashlib.

so the idea is to in addition to each file in our storage, add another file with the same name plus an extension .checksum with the MD5SUM, am I right?

Yes, you can either add one checksum file per dataset or one "checksum summary" file that contains one hash per line example

willianpaixao commented 6 years ago

@cuducos proceeding to the implementation, as mentioned above, there are two possible ways:

  1. using python and therefore generating the checksum right after creating the dataset.
  2. use a bash command to generate the checksum for all files (datasets) in a folder and upload it to the bucket.

The 1. option would affect the output folder (usually data/ for all developers) while the 2. option would be executed by the admins when uploading datasets to the bucket. This is rather a business decision than technical. What is your opinion?

cuducos commented 6 years ago

using python and therefore generating the checksum right after creating the dataset.

I like this idea. The code, the know-how is shared with the community and nothing hidden in the core developers world ; ) Yet some minor checksum files should not be an issue to developers IMHO