okfnepal / nepal-budget-2074

Opening data of Nepal Budget 2074.
2 stars 1 forks source link

Suggestion: Use tabula-py #2

Closed amitness closed 7 years ago

amitness commented 7 years ago

If you've experience with Pandas and Python, then you could try: tabula-py if you're doing it from GUI currently.

Here's an example:

import tabula

# Get dataframe from pdf of page 128
df = tabula.read_pdf('budget.pdf', pages=128)

# Post processing
# Optional: Can also use R or Excel 
#---------------------------------------------------
# Remove total row
df.drop([17], axis=0, inplace=True)

# Remove blank column of total
df.drop(df.columns[1], axis=1, inplace=True)

# Remove number fron STG code column
df['STG Code'] = df['STG Code'].apply(lambda x: x.split(' ', 1)[1])

#-----------------------------------------------------

# Export table to CSV
df.to_csv('data.csv', index=None)
kshitizkhanal7 commented 7 years ago

thank you @amitness for the suggestion! I will try this. It this makes it easier to clean the data, I am going to use this. Cleaning is a hassle for the tables in the pdf.