micronutrientsupport / database-architecture

The Postgres database code for the MAPS tool
3 stars 0 forks source link

Data Precision #353

Open rbroth opened 1 year ago

rbroth commented 1 year ago

Teams discussion

@Fuhji which fields have you had problems with?

rbroth commented 1 year ago

GitLab MR: https://kwvmxgit.ad.nerc.ac.uk/bmgf-maps/data/db-test-data/-/merge_requests/78

rbroth commented 1 year ago

@LouiseAnder @kpitten

Importing excel data (the intervention costings data) in the database has introduced false precision due to the way computers store and calculate numbers and the way the libraries that connect our code to excel work. As a results, we have a whole bunch of numbers with a bajillion numbers after the decimal in the database. This is obviously not correct.

To fix it, there's two approaches:

  1. Just round all numbers to two decimal points. Takes 20 minutes to do (I've already got a merge request in for it). Doesn't fully reflect Katie's intention. We can improve it later if it's not good enough.
  2. Dig into the code of the library, figure out how numbers and decimal points are handled, write code to retrieve excel formatting information and convert it to python, etc. Will likely take at least hours of time.

I'm inclined to go with solution 1 for the time being and procrastinate on doing solution 2.

Yay/Nay?

kpitten commented 1 year ago

Hi Roman,

Hmm…most numbers in the cost spreadsheets should have no decimals (so as not to appear more precise than they are), and a few really do need decimals (e.g., fortification levels for vitamin B12, which would never be anything larger than 0.xxx, full time equivalents, etc). Would it be possible to round everything to the nearest whole number and then manually change specific fields to have decimals? Louise, what do you think?

Best, Katie

rbroth commented 1 year ago

Would it be possible to round everything to the nearest whole number and then manually change specific fields to have decimals?

Very possible, though would obviously also be lots of work. And we wouldn't be able to do it until we release, since we're re-importing the data from the spreadsheets continuously.

Beyond decimals, there are other instances of too many significant figures e.g. the total 10-year cost is reported as $3,908,368,096 in the excel sheet (maybe we can get a $96 discount somewhere and make it an even $3,908,368,000 😉).

I figured that, at least for version 1, having a few too many decimals might not be too bad as long as it allows us to get rid of numbers like 3424.3184146701915