When analysing the nutritional values of a category, we often find outliers. These are products with nutritional values that do not correspond to the category. This can be due to multiple reasons: the product has the wrong values transcribed, the product belongs to another category, the values are per serving or the product has unrealistic values.
Proposed solution
These outliers can easily be found visually by looking at a scatterplot which plots two nutritional values. The outlier will lie far from the other products. It would however be nice to detect these automatically. This can be done by using the statistics of a category. By using the statistical percentiles (10% and 90%) of the category we can define a minimum and maximum outlier limits. Anything above or below these outlier values should be looked at in detail and if possibly repaired (values or category). These limits then define the nutritional envelope of a category. Usually the outliers are based on the quartiles, but we can do it with the percentiles. First calculate the distance between the 90% and 10% percentiles. This interpercentile distance (IPR) is then used to define the lower envelope: 10% percentile minus the IPR (or zero), and the upper envelope: the 90% plus the IPR (or 100). This outlier detection should be automated and result in a data quality error. Then the product can be repaired or flagged (folksonomy).
Additional context
It is impossible to apply this to the entire database, it will result in way to many errors. Better start with selected and cleaned up categories. These cleaned-up categories could be set by a flag in the taxonomy.
Problem
When analysing the nutritional values of a category, we often find outliers. These are products with nutritional values that do not correspond to the category. This can be due to multiple reasons: the product has the wrong values transcribed, the product belongs to another category, the values are per serving or the product has unrealistic values.
Proposed solution
These outliers can easily be found visually by looking at a scatterplot which plots two nutritional values. The outlier will lie far from the other products. It would however be nice to detect these automatically. This can be done by using the statistics of a category. By using the statistical percentiles (10% and 90%) of the category we can define a minimum and maximum outlier limits. Anything above or below these outlier values should be looked at in detail and if possibly repaired (values or category). These limits then define the nutritional envelope of a category. Usually the outliers are based on the quartiles, but we can do it with the percentiles. First calculate the distance between the 90% and 10% percentiles. This interpercentile distance (IPR) is then used to define the lower envelope: 10% percentile minus the IPR (or zero), and the upper envelope: the 90% plus the IPR (or 100). This outlier detection should be automated and result in a data quality error. Then the product can be repaired or flagged (folksonomy).
Additional context
It is impossible to apply this to the entire database, it will result in way to many errors. Better start with selected and cleaned up categories. These cleaned-up categories could be set by a flag in the taxonomy.