okfn-brasil / serenata-toolbox

📦 pip module containing code shared across Serenata de Amor's projects | ** Este repositório não recebe atualizações frequentes **
MIT License
154 stars 69 forks source link

Script to Download and include Supervised Learning #126

Open silviodc opened 7 years ago

silviodc commented 7 years ago

After the contribution of many people we built a gold standard as reference to indicate if a reimbursement is a generalization or not. Example of generalization: 5635048.pdf

Not a generalization: 5506259.pdf Our sample of reference consists in 1691 suspicious, and 1691 not suspicious reimbursements link. It was manually curated as explained in this video made by Felipe Cabral apoia.se

The goal of this dataset is to deal with this part of CEAP:

O documento que comprova o pagamento não pode ter rasura, acréscimos, emendas ou entrelinhas, deve conter data e deve conter os serviços ou materiais descritos item por item, sem generalizações ou abreviaturas, podendo ser:

Thus, this issue aims the following:

  1. Transfer the files i have in google drive to Amazon S3
  2. Create a script to download the above files in the toolbox. It could be a new category of datasets, e.g., Supervised Learning
  3. Create a script to download pre-built Machine learn models to Rosie

First objective, find hereafter the files i have: PNG images CSV reference

Regarding the CSV files, we have to include the direct link to chamber of deputies. Right now it only has the link to Jarbas. To easily do that, you have to take the document id from CSV file, and get the full link using a method like that:

def document_url(record):
    return 'http://www.camara.gov.br/cota-parlamentar/documentos/publ/%s/%s/%s.pdf' %\
        (record['applicant_id'],record['year'], record['document_id'])

The Dataset i used was this one:

data = pd.read_csv('../data/2016-11-19-last-year.xz',
                   parse_dates=[16],
                   dtype={'document_id': np.str,
                          'congressperson_id': np.str,
                          'congressperson_document': np.str,
                          'term_id': np.str,
                          'cnpj_cpf': np.str,
                          'reimbursement_number': np.str})

data=data[data['subquota_description']=='Congressperson meal']
data=data[data['document_id'].isin(doc_ids)]
# The doc_ids are retrieved from the csv file

The first objective will allow more people to have access to these curated files in order to replicate and create new experiments!

Second objective: The goal is to call some method like:

from serenata_toolbox.chamber_of_deputies.dataset import Dataset
chamber_of_deputies = Dataset(self.path)
chamber_of_deputies.fetch_supervised_learning()

It makes the integration of the mentioned files easily in other parts of project, e.g., Classifier using these files, Analyse using these data

Third objective, as you can see in the mentioned link: Classifier using these data](https://github.com/datasciencebr/rosie/pull/66)

To upload big files in git is not a good practice. Therefore to facilitate the contribution of new models to Rosie we have to create a method to specify which model we would like to retrieve. Example right now:

 classifier.__name__ == 'MealGeneralizationClassifier':
 model = classifier()
 model.fit('rosie/chamber_of_deputies/classifiers/keras/model/weights.hdf5')

Proposed:

 classifier.__name__ == 'MealGeneralizationClassifier':
 model = classifier()
 model.fit(self._model('generalization'))

It will allows us in the future to include more models and re-training the existent ones to be more robust. To this task find hereafter my model: Meal Generalization

PS: To upload files we have this method in the toolbox remote.py

cuducos commented 7 years ago

Thus, this issue aims the following:

  1. Transfer the files i have in google drive to Amazon S3
  2. Create a script to download the above files in the toolbox. It could be a new category of datasets, e.g., Supervised Learning
  3. Create a script to download pre-built Machine learn models to Rosie

Hooray. Let's avoid reinventing the wheel.

Actually these 3 points are already here.. We have methods for that in serenata_toolbox.Datasets. The only detail is that uploading is not open to everybody for security reasons. What people usually do is to share in the PRs a link to a (temporary) storage such as WeTransfer, Dropbox, Google Drive etc. and then someone with the API keys to handle the upload puts the file there.

Also I don't think that by now we need a special directory, method or class for supervisioned learning files.

Finally downloading the PDF is already automated.

That said the focus of this issue is simply to foster the development of a script/method able to generate this file (i.e. port it from the other repository/notebook to the toolbox in case anyone willing to reproduce our steps wants to star for scratch). Basically the starting from .pdf to .png.

silviodc commented 7 years ago

Hi @cuducos Thanks so much for the remarks! I will try my best to highlight the differences i would like to point in this issue.

Actually these 3 points are already here.. We have methods for that in serenata_toolbox.Datasets

I totally agree with you that all methods to download a determined quantity of reimbursements or full datasets are already implemented as well the methods to upload it using proper credentials.

Therefore, the focus of this issue is not to reinventing the wheel, but to integrate these curated reimbursements and files to the serenata_toolbox.

  1. Lets suppose we have a classifier to deal with handwrite reimbursements, and someone would like to test Rosie using these curated files. Since internet connection is always a problem in some regions of Brazil, download 700MB of images (or more) at once reduce the usability of this dataset. Therefore, after to discuss with @jtemporal we thought that to have a csv file which allows smoothly reconstruct the structure of this dataset (training[positive, negative] | validation[positive/negative] | pos_validation[positive, negative]) could be could be an asset.

  2. Backup of these files: The previous experiences showed that sometimes the chamber of deputies remove the invalid reimbursements. So what will happen to download link is a mystery. I'm okay to have pdf or png filed stored. The only point is to convert pdf to png is time consuming too. Since we can have a backup of transformed files IMHO is better to have it in png.

  3. Training phase: Since the model of this dataset is hardly changed, to train Rosie which new run to produce the same model is also time consuming. In my machine it took 3 hours. I guess we could have pre-built models that will be changed once a year.

Also I don't think that by now we need a special directory, method or class for supervisioned learning files.

I agree that right now could be early to it. We only have this curated dataset and one classifier. However in case more people build such models, the future architecture of the serenata-toolboox can smoothly deal with this demand. Furthermore, It could open the opportunity to receive new contributions to Rosie.

That said the focus of this issue is simply to foster the development of a script/method able to generate this file (i.e. port it from the other repository/notebook to the toolbox in case anyone willing to reproduce our steps wants to star for scratch). Basically the starting from .pdf to .png

As i mentioned in 2 and in my experience building these curated files, the removal of files in the chamber is always a surprise. While i was publishing the csv in google docs, me and felipe cabral have seen some reimbursements without the link for the pdf, even i had all these files in my computer because i downloaded it one month before.

So, more than reproduce the experiments we are losing curated reimbursements. I mean, we only have the link, not the pdf or image to show it in the chamber and ask for clarifications.

Maybe we could discuss and Restructure it better.

cuducos commented 7 years ago

Ok… first a minor comment:

The only point is to convert pdf to png is time consuming too. Since we can have a backup of transformed files IMHO is better to have it in png.

I agree that we could have this receipts converted to PNG and stored. And we can upload anything we want to some storage like S3. The point is that anything uploaded there must be produced from scripts available in our repos ; )

That said, the issue is ver straightforward IMHO:

Am I missing anything?

Many thanks for all the effort and clarification, @silviodc — let's tackle it!

silviodc commented 7 years ago

Continuing...

The point is that anything uploaded there must be produced from scripts available in our repos ; )

Do you mean, you would like to do this workflow in our machine in order to upload the result to storage? First workflow:

  1. take the csv file with references,
  2. download the pdfs,
  3. transform pdfs to png
  4. generate the folder structure (training[positive, negative] | validation[positive/negative]
  5. finally upload to storage?

Second workflow:

  1. Download the previous reference
  2. Train Rosie and upload the best generated model.

I'm okay with it. Even some recipes can not be download anymore, we still with a represent quantity of data.

Implement the code that trains Rosie as part of the toolbox (after all we need to allow most avid collaborators to generate their training models for whatever reason)

I didn't get your point here. Could you explain it better? I mean, once i have the reference data to generate a new model in my machine is easy.

The point i mentioned is: we should avoid the overhead to train references every time, since the model is hardly changed. I guess we have to allow people to download Rosie official models. I provided mine as base, however we can generate other.

Maybe rename serenata_toolbox.Datasets to something like serenata_toolbox.Storage

It's sounds much better now.

Generate a version of the images and of the trained model to upload to the storage (S3)

The version control of trained model is perfectly fine. However, if we change the image we also have to change the reference. It means curate again some parts of our gold standard. It is a laborious task to demand people to curate again something they have done before.

I think we could include new references, but never change the past curated.

What do you think?

silviodc commented 7 years ago

Just clarifying the last part of my previous comment.

However, if we change the image we also have to change the reference. It means curate again some parts of our gold standard. It is a laborious task to demand people to curate again something they have done before. I think we could include new references, but never change the past curated.

Considering this part, take a look in this suspicious reimbursement found by Rosie

Probably at the beginning we will build the references classifying it as a positive suspicious. We will have a stored image and our models will use it during the training. What is fine.

However, as you can see the chamber changed the recipe. The deputy included an new page clarifying that this recipe is not suspicious. It is completely fine for the chamber's API. However for our robot or contributors it can brings some problems in the future if we always download the pdfs or use the csv reference alone.

That said, i also think we could provide the images in small blocks, e.g., [100MB]. To solve this:

Since internet connection is always a problem in some regions of Brazil, download 700MB of images (or more) at once reduce the usability of this dataset.

Furthermore, we have to say that these images reflects the csv references files, not if a reimbursement is suspicious or not in the chamber. It will avoid future surprises we can have using unclear terminologies/affirmations and the direct data from chamber's API.

We could go for a second phase of reclassification of these references, once a year and then change the entire reference. In IHMO it avoids inconsistence problems, that reflects in our training/predictions for supervised classifiers.

willianpaixao commented 5 years ago

@silviodc Any news towards this issue? Would you mind publishing your latest results and writing what's left so someone else could pick up? CC @cuducos

silviodc commented 5 years ago

Hi @willianpaixao,

I suggest you to have a look in these 3 pull request:

https://github.com/okfn-brasil/serenata-de-amor/pull/238

https://github.com/okfn-brasil/serenata-de-amor/pull/286

https://github.com/okfn-brasil/rosie/pull/66

I think it is a good start to implement what we discussed here: https://docs.google.com/document/d/1qjYHKr9FLAaDwI4VJeHbcWj4LKrAWBeHqrB6VqBLXi8/edit#heading=h.6nco566ujoz0 On page 2.

Let me know if you have any questions,

Best, Silvio