okfn-brasil / serenata-toolbox

📦 pip module containing code shared across Serenata de Amor's projects | ** Este repositório não recebe atualizações frequentes **
MIT License
154 stars 69 forks source link

API for a high level version of the datasets #159

Open cuducos opened 6 years ago

cuducos commented 6 years ago

What is the problem?

Dealing with the CSV generated by the toolbox is not trivial: before pd.read_csv we need to define a lot of dtype, in Jarbas we spent a bunch of lines of code deserializing data (converting strings to date objects, to integers and floats).

How can this be addressed?

@turicas and I talked today and he suggested that the toolbox could offer an API not only to generate a CSV version of our datasets, but also a high level iterator for them. Something like:

from serenata_toolbox.federal_senate import Reader

for row in Reader('path_to.csv'):
    print(row)

And the output would be an object with proper types (int, Decimal, date etc.).

Who could help with this issue? @turicas ; )

turicas commented 6 years ago

I'm implementing this on: https://github.com/turicas/serenata-toolbox/tree/feature/dataset-reader

turicas commented 6 years ago

All the datasets in Brasil.IO will use the datapackage specification (for more info, see this milestone) and I think it could be the default way to access data in Serenata also (there are libraries to deal with it automatically so we don't need to create converters, just the datapackage spec). What do you think?