okfn-brasil / serenata-de-amor

🕵 Artificial Intelligence for social control of public administration | **This repository does not receive frequent updates. Check out the README**
https://serenata.ai/en
MIT License
4.51k stars 661 forks source link

Inconsistency in `meal price outlier classifier` #489

Open sergiomario opened 5 years ago

sergiomario commented 5 years ago

What is the problem? In some publications made by Rosie's Twitter, it is noted that the value identified as suspect is within the standard deviation established by the classifier.

As can be verified in the following suspicion: Suspicions Tweet Jarbas Documebt In this case the value is only 34.50 BRL.

How can this be addressed? I think it is necessary to adjust the classifier rules or improve the training set.

cuducos commented 5 years ago

Arguably this example does not shows what you claim it does.

1. Average value for this venue is R$ 13

Using Jarbas's shell_plus we can see they are low in term of total value/price, in average each claim sums R$ 13.22:

In [1]: import statistics

In [2]: values = tuple(r.total_net_value for r in Reimbursement.objects.filter(cnpj_cpf='05467695000130'))

In [3]: sum(values) / len(values)
Out[3]: Decimal('13.22262773722627737226277372')

2. Standard deviation is R$ 6

Also, the standard deviation is around R$ 6.66 (hey devil 😈):

In [4]: statistics.stdev(values)
Out[4]: Decimal('6.655636628195527505758211117')

3. Thus R$ 34 happens to be above the threshold

Thus the threshold is R$ R$ 33.19, below the example value of R$ 34.05:

In [5]: (sum(values) / len(values)) + (3 * statistics.stdev(values))
Out[5]: Decimal('33.18953762181285988953740707')
cuducos commented 5 years ago

Also we can check the (arguably) low values in Jarbas: https://jarbas.serenata.ai/dashboard/chamber_of_deputies/reimbursement/?q=05467695000130

cuducos commented 5 years ago

What I mean is… Rosie has a good accuracy, but 100% is impossible. This example seams more a false-positive than a bug. Sure we can learn with this example and improve the classifier, let's say, to ask it to only consider venues with averages greater than a certain minimum limit ; )