okfn-brasil / serenata-toolbox

📦 pip module containing code shared across Serenata de Amor's projects | ** Este repositório não recebe atualizações frequentes **
MIT License
154 stars 69 forks source link

Clean/convert all the data before exporting #101

Open turicas opened 7 years ago

turicas commented 7 years ago

I think it's serenata-toolbox's responsability to convert and clean data, such as replacing , with . in float values, date in the format %d/%m/%y to %Y-%m-%d and so on, so jarbas, rosie and other tools don't need to bother about this kind of task.

If it's true, then we need to move code like to_number and to_date from jarbas to here, remove from the other repositories (and maybe convert some files that were already exported and are hosted on S3).

This issue is kind of related to #87.

@cuducos could you please help me validating the issue requirements and add more details, if possible? I can work on this.

cuducos commented 7 years ago

I think it's serenata-toolbox's responsability to convert and clean data, such as replacing , with . in float values, date in the format %d/%m/%y to %Y-%m-%d and so on

I like this idea — and I would add that having unformatted CNPJ in YYYY-MM-DD-reimbursements and formatted CNPJ in YYYY-MM-DD-companies.xz is quite annoying. Issues like that could be addressed with default parsers for the same data in multiple sources.

and maybe convert some files that were already exported and are hosted on S3

We just create new version of these datasets, no problem at all because previous analysis would link to the previous versions of the dataset — and we encourage everyone to use the newer versions…

This issue is kind of related to #87

Related (for sure) but the focus there is mostly semantic (it bugs me that a translate translate data labels and change the file extension, and stuff like that). Too much logic packed in a few methods, it's difficult to test these methods, it's difficult to name them properly etc.