Open sergiomario opened 5 years ago
Arguably this example does not shows what you claim it does.
1. Average value for this venue is R$ 13
Using Jarbas's shell_plus
we can see they are low in term of total value/price, in average each claim sums R$ 13.22:
In [1]: import statistics
In [2]: values = tuple(r.total_net_value for r in Reimbursement.objects.filter(cnpj_cpf='05467695000130'))
In [3]: sum(values) / len(values)
Out[3]: Decimal('13.22262773722627737226277372')
2. Standard deviation is R$ 6
Also, the standard deviation is around R$ 6.66 (hey devil 😈):
In [4]: statistics.stdev(values)
Out[4]: Decimal('6.655636628195527505758211117')
3. Thus R$ 34 happens to be above the threshold
Thus the threshold is R$ R$ 33.19, below the example value of R$ 34.05:
In [5]: (sum(values) / len(values)) + (3 * statistics.stdev(values))
Out[5]: Decimal('33.18953762181285988953740707')
Also we can check the (arguably) low values in Jarbas: https://jarbas.serenata.ai/dashboard/chamber_of_deputies/reimbursement/?q=05467695000130
What I mean is… Rosie has a good accuracy, but 100% is impossible. This example seams more a false-positive than a bug. Sure we can learn with this example and improve the classifier, let's say, to ask it to only consider venues with averages greater than a certain minimum limit ; )
What is the problem? In some publications made by Rosie's Twitter, it is noted that the value identified as suspect is within the standard deviation established by the classifier.
As can be verified in the following suspicion: Suspicions Tweet Jarbas Documebt In this case the value is only 34.50 BRL.
How can this be addressed? I think it is necessary to adjust the classifier rules or improve the training set.