opendataam / opendatam-tasks

Public tasks for volunteers, hackathons and contests
Creative Commons Zero v1.0 Universal
8 stars 0 forks source link

[EN] Convert Yerevan budget documents from PDF to CSV/XML/JSON machine readable data #5

Open ivbeg opened 1 year ago

ivbeg commented 1 year ago

Goal

The goal is to create a dataset with the Yerevan city budget for further analysis and visualization. It could be done now since the budget is being published as a set of PDF documents.

Tasks

The Yerevan city budget for the 2023 year and report about the budget execution of the 2022 budget will be published on the city website https://www.yerevan.am/hy/finance/ as a set of archives with PDF documents insight.

These documents have a text layer that can be processed to extract tables.

These tables look like this изображение Page 49 from the budget report of the 2022 year https://www.yerevan.am/uploads/media/default/0002/19/b257858f7a9940c75efc4a98acb88e949dd6e554.pdf

  1. These tables should be extracted and converted as Excel, CSV, or JSON files—one file per table.
  2. It would be great if table headers were in English and headers were translated from Armenian to English. For example, Եկամտատեսակները in the Excel or CSV file should be written as "income".
  3. It would be even better if you could convert past budgets too, city budgets 2018-2022 available as sub-pages at the same link

Context

The Yerevan budget was published as a set of Armenian-only text/pdf documents without any machine-readable or at least Excel file.

To convert PDF files to Excel or CSV/JSON, you could use ABBYY Finereader, Tabula, or any other tool that could help.

Requirements

Wishes

Please write your code as reusable code that could be launched by someone else later since we could need to update this dataset later.

Resources

Prepared by

The Open Data Armenia team prepared this task

dkagramanyan commented 1 year ago

I can help with converting pdf files to csv/json/excel but after 1.5 weeks. I need to complete several study projects before exams

dkagramanyan commented 1 year ago

@ivbeg Hi! I processed one pdf document. Is everything okay? If not, please tell me where there is a mistake. Tables are available on google drive and on my repository

ivbeg commented 1 year ago

@dkagramanyan Hi! Yes, it looks great! P.S. repository looks private, so I've checked only google drive documents

dkagramanyan commented 1 year ago

Added 30 new tables. About 50 tables left to process. Loaded new data to the same gdrive

ansakoy commented 1 year ago

@dkagramanyan thanks a lot! If you could possibly share the code that did the trick it would be just perfect (the repo you referred to above is private, as @ivbeg pointed out). Meanwhile, best of luck with your exams.

dkagramanyan commented 1 year ago

@ansakoy in fact, I didn't use any code to parse those tables. I converted the pdf files with FineReader and then manually made corrections. But I think, this method can't be used for the remaining 50 tables as it is very time consuming. I will have to come up with some kind of automatic method

P.S. now repository is public

dkagramanyan commented 1 year ago

Hi! I have successfully completed parsing of Yerevan budget 2023. The data is available in my repository and on gdrive. Links are in my previous comments

dkagramanyan commented 1 year ago

@ivbeg Hi! Is everything ОК with the data?

ansakoy commented 1 year ago

@dkagramanyan Thanks a lot, David, this was really useful. The data look fine to me. Ivan has been away on business, so he could not respond promptly, but he will as soon as possible.

ivbeg commented 1 year ago

@dkagramanyan Hi David! Sorry, I was on business trip for some time and was unable to answer. Yes, it looks great! Thanks a lot.