nptscot / npt

Data processing code, also use this repo for issue tracking for the Network Planning Tool. See https://nptscot.github.io for development version
https://www.npt.scot/
GNU Affero General Public License v3.0
5 stars 0 forks source link

Broadband data uses 0 for voids rather than out-of-range value #385

Closed mvl22 closed 4 months ago

mvl22 commented 5 months ago

Currently the absence of broadband data results in a data value 0, but this is within the range of 0>100 so cannot be distinguished from real data, and the real data has to use 0.01:

https://github.com/nptscot/nptscot.github.io/blob/7a1f11928f0b296037644f11cf0a27ecafb6ebd2/js/layer_control.js#L206-L207

This should be changed so that voids use a known out-of-range value, e.g. -9999 or whatever you wish to standardise on.

Currently the website code has to have special handling to deal with this inconsistency.

(This issue became more overt as a result of the refactoring work.)

mvl22 commented 5 months ago

Just to flag up that this issue is holding up various refactoring work, as I can't generalise the value lists yet.

@Robinlovelace assigned mvl22 2 weeks ago

I don't think this assignment to me is correct - this is an upstream data issue that someone in the data team needs to address.

Robinlovelace commented 5 months ago

Re-assigned. @mem48 do you know where in the codebase these values come from?

mem48 commented 5 months ago

Its from the SIMD data

Robinlovelace commented 5 months ago

@mvl22 how do you know that

the absence of broadband data results in a data value 0

These are the raw values:

table(zones$broadband)

  0%   1%  10% 100%  11%  12%  13%  14%  15%  16%  17%  18%  19%   2%  20%  21% 
4451  393   54    6   43   53   52   32   39   33   33   39   34  217   28   31 
 22%  23%  24%  25%  26%  27%  28%  29%   3%  30%  31%  32%  33%  34%  35%  36% 
  28   25   16   29   22   21   17   17  131   19   16   16   23   13   23   25 
 37%  38%  39%   4%  40%  41%  42%  43%  44%  45%  46%  47%  48%  49%   5%  50% 
  12   16   21  121   21   17   12   15   17   16   10   15   19   12   85    7 
 51%  52%  53%  54%  55%  56%  57%  58%  59%   6%  60%  61%  62%  63%  64%  65% 
  11    7    8   19   10   12    9   10    9   69    8   11    6    6    6   10 
 66%  67%  68%  69%   7%  70%  71%  72%  73%  74%  75%  76%  77%  78%  79%   8% 
   6   10    8   14   67   10    5    9    9    8    5    4    6    5    8   65 
 80%  81%  82%  83%  84%  85%  86%  87%  88%  89%   9%  90%  91%  93%  94%  95% 
   2    7   11    6    2    7    6    7    5    2   55    2    7    1    1    2 
 96%  97%  98% 
   3    2    1 
Robinlovelace commented 5 months ago

As shown here the results do look a bit strange but according to @mem48 there's no evidence that they are wrong:

image

Robinlovelace commented 5 months ago

Reprex and closing:

u = "https://statistics.gov.scot/downloads/cube-table?uri=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Fscottish-index-of-multiple-deprivation---broadband-access-indicator"
b = readr::read_csv(u)
#> Rows: 6976 Columns: 7
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (5): FeatureCode, FeatureName, FeatureType, Measurement, Units
#> dbl (2): DateCode, Value
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(b)
#> # A tibble: 6 × 7
#>   FeatureCode FeatureName   FeatureType    DateCode Measurement Units      Value
#>   <chr>       <chr>         <chr>             <dbl> <chr>       <chr>      <dbl>
#> 1 S01006511   Culter - 06   2011 Data Zone     2019 Percent     Percent o…  5.3 
#> 2 S01006512   Culter - 07   2011 Data Zone     2019 Percent     Percent o… 53.8 
#> 3 S01006506   Culter - 01   2011 Data Zone     2019 Percent     Percent o… 10.5 
#> 4 S01006507   Culter - 02   2011 Data Zone     2019 Percent     Percent o…  1.36
#> 5 S01006510   Culter - 05   2011 Data Zone     2019 Percent     Percent o…  0.31
#> 6 S01006528   Garthdee - 03 2011 Data Zone     2019 Percent     Percent o…  1.78
summary(b)
#>  FeatureCode        FeatureName        FeatureType           DateCode   
#>  Length:6976        Length:6976        Length:6976        Min.   :2019  
#>  Class :character   Class :character   Class :character   1st Qu.:2019  
#>  Mode  :character   Mode  :character   Mode  :character   Median :2019  
#>                                                           Mean   :2019  
#>                                                           3rd Qu.:2019  
#>                                                           Max.   :2019  
#>  Measurement           Units               Value        
#>  Length:6976        Length:6976        Min.   :  0.000  
#>  Class :character   Class :character   1st Qu.:  0.000  
#>  Mode  :character   Mode  :character   Median :  0.000  
#>                                        Mean   :  7.441  
#>                                        3rd Qu.:  3.803  
#>                                        Max.   :100.000
sum(b$Value == 0)
#> [1] 3971

Created on 2024-02-01 with reprex v2.1.0

mvl22 commented 4 months ago

What is the null value now?

Is 0 now in place meaning genuinely none?

Please let me know when this version is reflected in a build uploaded to the tile server somewhere (and the URL), and I will update the definitions in the refactor code branch to reflect this change.

Robinlovelace commented 4 months ago

There are no null values as per content above. 0 means and always meant 0 as far as I can tell. Why did you think 0 meant NA @mvl22 ?

mvl22 commented 4 months ago

Why did you think 0 meant NA?

I thought this came up in the opening meeting we had. Malcolm, according to my recollection, was explaining that the value of 0.01 was having to be used in the code here because the real zero meant void data:

https://github.com/nptscot/nptscot.github.io/blob/dev/src/datasets.js#L302

Robinlovelace commented 4 months ago

I'm not sure what the issue is with on the backend. As far as I can see we're not modifying anything, this is faithful to the raw data. Happy to re-open if this is a confirmed issue but having looked at the data I cannot see any issue with the data provided to the front-end so closing for now until there's a clearly articulated ask. My ask: what change are you hoping to see in which file so we can confirm when this is done?