Open turicas opened 7 years ago
I think it's serenata-toolbox's responsability to convert and clean data, such as replacing
,
with.
in float values, date in the format%d/%m/%y
to%Y-%m-%d
and so on
I like this idea — and I would add that having unformatted CNPJ in YYYY-MM-DD-reimbursements
and formatted CNPJ in YYYY-MM-DD-companies.xz
is quite annoying. Issues like that could be addressed with default parsers for the same data in multiple sources.
and maybe convert some files that were already exported and are hosted on S3
We just create new version of these datasets, no problem at all because previous analysis would link to the previous versions of the dataset — and we encourage everyone to use the newer versions…
This issue is kind of related to #87
Related (for sure) but the focus there is mostly semantic (it bugs me that a translate
translate data labels and change the file extension, and stuff like that). Too much logic packed in a few methods, it's difficult to test these methods, it's difficult to name them properly etc.
I think it's serenata-toolbox's responsability to convert and clean data, such as replacing
,
with.
infloat
values, date in the format%d/%m/%y
to%Y-%m-%d
and so on, so jarbas, rosie and other tools don't need to bother about this kind of task.If it's true, then we need to move code like to_number and to_date from jarbas to here, remove from the other repositories (and maybe convert some files that were already exported and are hosted on S3).
This issue is kind of related to #87.
@cuducos could you please help me validating the issue requirements and add more details, if possible? I can work on this.