Open ivbeg opened 1 year ago
I can help with converting pdf files to csv/json/excel but after 1.5 weeks. I need to complete several study projects before exams
@ivbeg Hi! I processed one pdf document. Is everything okay? If not, please tell me where there is a mistake. Tables are available on google drive and on my repository
@dkagramanyan Hi! Yes, it looks great! P.S. repository looks private, so I've checked only google drive documents
Added 30 new tables. About 50 tables left to process. Loaded new data to the same gdrive
@dkagramanyan thanks a lot! If you could possibly share the code that did the trick it would be just perfect (the repo you referred to above is private, as @ivbeg pointed out). Meanwhile, best of luck with your exams.
@ansakoy in fact, I didn't use any code to parse those tables. I converted the pdf files with FineReader and then manually made corrections. But I think, this method can't be used for the remaining 50 tables as it is very time consuming. I will have to come up with some kind of automatic method
P.S. now repository is public
Hi! I have successfully completed parsing of Yerevan budget 2023. The data is available in my repository and on gdrive. Links are in my previous comments
@ivbeg Hi! Is everything ОК with the data?
@dkagramanyan Thanks a lot, David, this was really useful. The data look fine to me. Ivan has been away on business, so he could not respond promptly, but he will as soon as possible.
@dkagramanyan Hi David! Sorry, I was on business trip for some time and was unable to answer. Yes, it looks great! Thanks a lot.
Goal
The goal is to create a dataset with the Yerevan city budget for further analysis and visualization. It could be done now since the budget is being published as a set of PDF documents.
Tasks
The Yerevan city budget for the 2023 year and report about the budget execution of the 2022 budget will be published on the city website https://www.yerevan.am/hy/finance/ as a set of archives with PDF documents insight.
These documents have a text layer that can be processed to extract tables.
These tables look like this Page 49 from the budget report of the 2022 year https://www.yerevan.am/uploads/media/default/0002/19/b257858f7a9940c75efc4a98acb88e949dd6e554.pdf
Context
The Yerevan budget was published as a set of Armenian-only text/pdf documents without any machine-readable or at least Excel file.
To convert PDF files to Excel or CSV/JSON, you could use ABBYY Finereader, Tabula, or any other tool that could help.
Requirements
Wishes
Please write your code as reusable code that could be launched by someone else later since we could need to update this dataset later.
Resources
Prepared by
The Open Data Armenia team prepared this task