Where do the datasets download?

okfn-brasil / serenata-toolbox

📦 pip module containing code shared across Serenata de Amor's projects | ** Este repositório não recebe atualizações frequentes **

MIT License

154 stars 69 forks source link

Where do the datasets download? #216

Closed michaelyan-coupa closed 4 years ago

michaelyan-coupa commented 5 years ago

Where do the datasets download? I followed the README and wrote up a python script to perform the downloads, however I cannot find them within the folder. Thanks!

cuducos commented 5 years ago

From the README.md:

# will download these specific datasets and store into /tmp/serenata-data folder
$ serenata-toolbox /tmp/serenata-data --module federal_senate chamber_of_deputies

That is to say, the first argument is where data is stored. Have you used the first argument to direct the downloads to a specific folder? If you haven't, the default is data/.

michaelyan-coupa commented 5 years ago

Where can I access photos of the receipts? And where are the corresponding JSON files for the OCR extraction? I am referring to this post https://github.com/okfn-brasil/serenata-de-amor/issues/188

cuducos commented 5 years ago

Where can I access photos of the receipts?

As I explained elsewhere:

you can download [them] from the source concatenating the URL as we do in Jarbas.

The code linked is as follows:

        args = (self.applicant_id, self.year, self.document_id)
        return (
            'http://www.camara.gov.br/'
            'cota-parlamentar/documentos/publ/{}/{}/{}.pdf'
        ).format(*args)

Does that make sense?

cuducos commented 5 years ago

I see you've asked (but maybe deleted) about the .xz files, @michaelyan-coupa.

Yes, they are tha data you're looking for 🎉
They CSV compressed with LZMA 🖥
You can open then normally (without decompress) with pandas (e.g. pd.open_csv("reimburse,ents-2019.xz") 🐼
Alternatively you cam use xz to decompress them and rename the resulting file to CSV with, for example, xz --decompress reimbursements-2019.xz && mv reimbursements-2019 reimbursements-2019.csv ⌨️
xz is available in most UNIX platforms (try apt-get install lzma in Debian-based system or brew install lzma in macOS with Homebrew, for example)