openspending / cameroon.openspending.org

Website for "Cameroon Budget Inquirer"
http://cameroon.openspending.org/
7 stars 10 forks source link

Problems with the council budgets' datasets #40

Open vitorbaptista opened 10 years ago

vitorbaptista commented 10 years ago

I'm creating this issue just to document any problems we find in the datasets

vitorbaptista commented 10 years ago

I found some odd and repeated head-accounts and sub-accounts that even though have the same ID, have different descriptions. They're:

Head-account Description
110 Reserves
110 Uncategorized
7.20 Proceeds from exploitation of council properties and services

(The last one is just an example of a head-account that doesn't follow the pattern of a 3-digit number.)

Sub-account Description
670.112 Cotisations professionnelles aux synidcats (2009 only)
670.112 Fonctionnemen commisse de Passation des marche (2008 only)
713.118 (only 2008) Other indirect council taxes
713.118 (only 2009) "Proceeds from taxes on on mineral exploitation, fishery ressources etc."
713.119 (only 2009) Other indirect council taxes
Total Sub-Head 6.12
vitorbaptista commented 10 years ago

I'm trying to find a set of columns that uniquely identify each row. These are the columns in the dataset:

Head-account Head-Account Description Sub-account Sub-account Description Year Reporting Type Amount Revenue/Expenditure Recurrent/Investment Expense Type Region Council Division type
... ... ... ... ... ... ... ... ... ... ... ... ...

Given that we're talking about annual budgets in each account, I would expect to, at least, be able to not have the Amount in the key (as there shouldn't be multiple budgets for the same account in the same place in the same year).

Testing with Tignere, Yaoundé 2, and Yaoundé 6's data, I could find many lines that are identical, which I assume we can safely ignore and upload just one. Adding all columns to the primary key except the Amount makes us loose one row:

Head-account Head-Account Description Sub-account Sub-account Description Year Reporting Type Amount Revenue/Expenditure Recurrent/Investment Expense Type Region Council Division type
612 Other services used Total Sub-Head 6.12 2011 Budget 0.0 EXPENDITURE RECURRENT Centre Yaoundé 6 COUNCIL
612 Other services used Total Sub-Head 6.12 2011 Budget 0 EXPENDITURE RECURRENT Centre Yaoundé 6 COUNCIL

The problem is that the Amount in one row is 0.0, and in the other it's 0. We should be safe ignoring this.

So, the unique columns are all columns except Amount. I'm not sure how it'll work for Head-account and Sub-account, as they're measures (i.e. they have both an id and a label). I'm not sure if, when using them as part of the primary key, OpenSpending will take in account both id and label, or just the id. I'll check.

vitorbaptista commented 10 years ago

There're many amounts that are in scientific notation, as 4.789109E7 and 2.0E7. They need to be converted to regular integers or floats before loading into OpenSpending.

vitorbaptista commented 10 years ago

Answering my previous question, OpenSpending expects all measures with the same id to be the same, so when adding it as part of the unique key, it won't take into account its label, but only its id. This being the case, we end up missing a few rows, as there're head-accounts and sub-accounts with the same id but different labels. They're:

Head-account Head-Account Description Sub-account Sub-account Description Year Reporting Type Amount Revenue/Expenditure Recurrent/Investment Expense Type Region Council Division type
670 Other expenses and sundry loses 670.112 Cotisations professionnelles aux synidcats (2009 only) 2008 Actual 0.0 EXPENDITURE RECURRENT Adamaoua Tignere COUNCIL
670 Other expenses and sundry loses 670.112 Cotisations professionnelles aux synidcats (2009 only) 2008 Budget 0.0 EXPENDITURE RECURRENT Adamaoua Tignere COUNCIL
670 Other expenses and sundry loses 670.112 Cotisations professionnelles aux synidcats (2009 only) 2009 Actual 185000.0 EXPENDITURE RECURRENT Adamaoua Tignere COUNCIL
670 Other expenses and sundry loses 670.112 Cotisations professionnelles aux synidcats (2009 only) 2009 Budget 0.0 EXPENDITURE RECURRENT Centre Yaoundé 2 COUNCIL
670 Other expenses and sundry loses 670.112 Cotisations professionnelles aux synidcats (2009 only) 2009 Budget 0.0 EXPENDITURE RECURRENT Centre Yaoundé 6 COUNCIL
670 Other expenses and sundry loses 670.112 Cotisations professionnelles aux synidcats (2009 only) 2009 Budget 2400000.0 EXPENDITURE RECURRENT Adamaoua Tignere COUNCIL
670 Other expenses and sundry loses 670.112 Cotisations professionnelles aux synidcats (2009 only) 2010 Budget 0.0 EXPENDITURE RECURRENT Centre Yaoundé 2 COUNCIL
670 Other expenses and sundry loses 670.112 Cotisations professionnelles aux synidcats (2009 only) 2010 Budget 0.0 EXPENDITURE RECURRENT Centre Yaoundé 6 COUNCIL
670 Other expenses and sundry loses 670.112 Cotisations professionnelles aux synidcats (2009 only) 2011 Budget 0.0 EXPENDITURE RECURRENT Centre Yaoundé 6 COUNCIL
670 Other expenses and sundry loses 670.112 Cotisations professionnelles aux synidcats (2009 only) 2012 Budget 0.0 EXPENDITURE RECURRENT Centre Yaoundé 2 COUNCIL
670 Other expenses and sundry loses 670.112 Cotisations professionnelles aux synidcats (2009 only) 2012 Budget 0.0 EXPENDITURE RECURRENT Centre Yaoundé 6 COUNCIL
670 Other expenses and sundry loses 670.112 Fonctionnemen commisse de Passation des marche (2008 only) 2008 Actual 424000.0 EXPENDITURE RECURRENT Adamaoua Tignere COUNCIL
670 Other expenses and sundry loses 670.112 Fonctionnemen commisse de Passation des marche (2008 only) 2008 Budget 1800000.0 EXPENDITURE RECURRENT Adamaoua Tignere COUNCIL
670 Other expenses and sundry loses 670.112 Fonctionnemen commisse de Passation des marche (2008 only) 2009 Actual 0.0 EXPENDITURE RECURRENT Adamaoua Tignere COUNCIL
670 Other expenses and sundry loses 670.112 Fonctionnemen commisse de Passation des marche (2008 only) 2009 Budget 0.0 EXPENDITURE RECURRENT Adamaoua Tignere COUNCIL
670 Other expenses and sundry loses 670.112 Fonctionnemen commisse de Passation des marche (2008 only) 2009 Budget 0.0 EXPENDITURE RECURRENT Centre Yaoundé 2 COUNCIL
670 Other expenses and sundry loses 670.112 Fonctionnemen commisse de Passation des marche (2008 only) 2009 Budget 0.0 EXPENDITURE RECURRENT Centre Yaoundé 6 COUNCIL
670 Other expenses and sundry loses 670.112 Fonctionnemen commisse de Passation des marche (2008 only) 2010 Budget 0.0 EXPENDITURE RECURRENT Centre Yaoundé 2 COUNCIL
670 Other expenses and sundry loses 670.112 Fonctionnemen commisse de Passation des marche (2008 only) 2010 Budget 0.0 EXPENDITURE RECURRENT Centre Yaoundé 6 COUNCIL
670 Other expenses and sundry loses 670.112 Fonctionnemen commisse de Passation des marche (2008 only) 2011 Budget 0.0 EXPENDITURE RECURRENT Centre Yaoundé 6 COUNCIL
670 Other expenses and sundry loses 670.112 Fonctionnemen commisse de Passation des marche (2008 only) 2012 Budget 0.0 EXPENDITURE RECURRENT Centre Yaoundé 2 COUNCIL
670 Other expenses and sundry loses 670.112 Fonctionnemen commisse de Passation des marche (2008 only) 2012 Budget 0.0 EXPENDITURE RECURRENT Centre Yaoundé 6 COUNCIL

The issue seems that in 2008, the sub-account 670.112 was Fonctionnemen commisse de Passation des marche, and in 2009 it turned into Cotisations professionnelles aux synidcats. This would be kind of OK, but I might be wrong because both sub-accounts have rows in years from 2008 to 2012. It might mean some other thing, then.

Another option would be to treat the account's ids and their description as two separate attributes, which is a odd but shouldn't have any disadvantage other than making our model a bit confusing.

vitorbaptista commented 10 years ago

So, to solve the problem with different head and sub-accounts having the same id, I created the attributes headaccount and subaccount as a measure (with their id and description), but also another headaccount_description and subaccount_description attributes with just their descriptions. I prefered to keep the headaccount and subaccount as measures because that's what they really are (but to be honest I'm not sure about other implications). These *_description attributes should (ideally) be used only on the unique key.

vitorbaptista commented 10 years ago

There're repeated rows both when you take all columns in account, and then you take all except Amount. I'm not sure if it's a problem with the PDF extraction, or if it's related to the data itself.

@pzwsk Could you check it?

vitorbaptista commented 10 years ago

This will be used as input for the guide created at #51. We can close it when we finish that.