[EN] Convert Yerevan budget documents from PDF to CSV/XML/JSON machine readable data

ivbeg commented 1 year ago

Goal

The goal is to create a dataset with the Yerevan city budget for further analysis and visualization. It could be done now since the budget is being published as a set of PDF documents.

Tasks

The Yerevan city budget for the 2023 year and report about the budget execution of the 2022 budget will be published on the city website https://www.yerevan.am/hy/finance/ as a set of archives with PDF documents insight.

These documents have a text layer that can be processed to extract tables.

These tables look like this Page 49 from the budget report of the 2022 year https://www.yerevan.am/uploads/media/default/0002/19/b257858f7a9940c75efc4a98acb88e949dd6e554.pdf

These tables should be extracted and converted as Excel, CSV, or JSON files—one file per table.
It would be great if table headers were in English and headers were translated from Armenian to English. For example, Եկամտատեսակները in the Excel or CSV file should be written as "income".
It would be even better if you could convert past budgets too, city budgets 2018-2022 available as sub-pages at the same link

Context

The Yerevan budget was published as a set of Armenian-only text/pdf documents without any machine-readable or at least Excel file.

To convert PDF files to Excel or CSV/JSON, you could use ABBYY Finereader, Tabula, or any other tool that could help.

Requirements

create a public GitHub repository to store code and data under one of the free and open licenses like Creative Commons license or MIT license

Wishes

Please write your code as reusable code that could be launched by someone else later since we could need to update this dataset later.

Resources

Yerevan city budget page https://www.yerevan.am/hy/finance/