open-sdg / sdg-build

Python package to convert SDG-related data and metadata between formats
MIT License
6 stars 22 forks source link

Some source data is getting rounded #171

Open LucyGwilliamAdmin opened 3 years ago

LucyGwilliamAdmin commented 3 years ago

This csv data file has been uploaded: https://github.com/ONSdigital/sdg-data/blob/16-1-4_lucytest/data/indicator_16-1-4.csv

But then some of the numbers are being rounded during the build: http://sdgdev-813006012.eu-west-1.elb.amazonaws.com/data/16-1-4_lucytest/en/data/16-1-4.csv

This means when a user downloads the source CSV, it's different to the CSV that was uploaded

LucyGwilliamAdmin commented 3 years ago

@brockfanning ok - I've looked into this a little more

Now if comparing the CSV that's been uploaded with the CSV that's downloaded from the platform there is some values where the last digit is changed (for example the 10th row of data (11th row if counting headers), the value in the uploaded CSV is 70.2624289702269 but the value in the downloaded CSV is 70.2624289702268

I have attached a file, which has two value columns, what was uploaded and what value was in the downloaded file, along with a column which shows whether they match 16.1.4-test.xlsx

jwestw commented 3 years ago

Old issue I know, but I just thought I would comment.

We can look into the code behind this and make sure that the statistics are being faithfully reported. These checks should also be occurring during unit tests.

Sometimes rounding is different in different languages or computation methods. So it may be that if the data came from Excel and is being put on the site using Python, so the last few digits are different, or something of that nature.

It's also worth thinking about if the difference between the numbers matters. Here the data refers to the percentage of people across the whole country. So the difference of 0.0000000000001% in the example accounts for much less than 1 person out of 60.18 million at the time.

LucyGwilliamAdmin commented 3 years ago

@jwestw thanks - I have considered some of these points.

It might be true that the values are changing due to diff languages but that's the thing we want to prevent

I understand your point - the difference in figures probably isn't that much in the grand scheme of things but it's important that the "Source" data is the source data