warint / statcanR

R Package to connect to Statistics Canada's open data portal
https://warint.github.io/statcanR/
Other
19 stars 5 forks source link

Parsing error warning #4

Open dmurdoch opened 11 months ago

dmurdoch commented 11 months ago

When I download the CPI table, I get a warning about a parsing issue:

statcanR::statcan_data("18-10-0006-01", "eng")
#> statcanR: downloading remote table.
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
#> Rows: 47 Columns: 10
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (10): Cube Title, Product Id, CANSIM Id, URL, Cube Notes, Archive Status...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#>         REF_DATE    GEO          DGUID
#>    1: 1992-01-01 Canada 2016A000011124
#>    2: 1992-01-01 Canada 2016A000011124
#>    3: 1992-01-01 Canada 2016A000011124
#>    4: 1992-01-01 Canada 2016A000011124
#>    5: 1992-01-01 Canada 2016A000011124
#>   ---                                 
#> 4209: 2023-11-01 Canada 2016A000011124
#> 4210: 2023-11-01 Canada 2016A000011124
#> 4211: 2023-11-01 Canada 2016A000011124
#> 4212: 2023-11-01 Canada 2016A000011124
#> 4213: 2023-11-01 Canada 2016A000011124
#>                                           Products and product groups      UOM
#>    1:                                                       All-items 2002=100
#>    2:                                                            Food 2002=100
#>    3:                                                         Shelter 2002=100
#>    4:                 Household operations, furnishings and equipment 2002=100
#>    5:                                           Clothing and footwear 2002=100
#>   ---                                                                         
#> 4209:                                        Health and personal care 2002=100
#> 4210:                               Recreation, education and reading 2002=100
#> 4211: Alcoholic beverages, tobacco products and recreational cannabis 2002=100
#> 4212:                                        All-items excluding food 2002=100
#> 4213:                             All-items excluding food and energy 2002=100
#>       UOM_ID SCALAR_FACTOR SCALAR_ID    VECTOR COORDINATE VALUE STATUS SYMBOL
#>    1:     17         units         0 v41690914        1.1  83.1     NA     NA
#>    2:     17         units         0 v41690915        1.2  82.0     NA     NA
#>    3:     17         units         0 v41690916        1.3  87.6     NA     NA
#>    4:     17         units         0 v41690917        1.4  87.7     NA     NA
#>    5:     17         units         0 v41690918        1.5  94.1     NA     NA
#>   ---                                                                        
#> 4209:     17         units         0 v41690920        1.7 147.3     NA     NA
#> 4210:     17         units         0 v41690921        1.8 129.0     NA     NA
#> 4211:     17         units         0 v41690922        1.9 193.3     NA     NA
#> 4212:     17         units         0 v41690923        1.1 154.0     NA     NA
#> 4213:     17         units         0 v41690924       1.11 149.4     NA     NA
#>       TERMINATED DECIMALS                                          INDICATOR
#>    1:         NA        1 Consumer Price Index, monthly, seasonally adjusted
#>    2:         NA        1 Consumer Price Index, monthly, seasonally adjusted
#>    3:         NA        1 Consumer Price Index, monthly, seasonally adjusted
#>    4:         NA        1 Consumer Price Index, monthly, seasonally adjusted
#>    5:         NA        1 Consumer Price Index, monthly, seasonally adjusted
#>   ---                                                                       
#> 4209:         NA        1 Consumer Price Index, monthly, seasonally adjusted
#> 4210:         NA        1 Consumer Price Index, monthly, seasonally adjusted
#> 4211:         NA        1 Consumer Price Index, monthly, seasonally adjusted
#> 4212:         NA        1 Consumer Price Index, monthly, seasonally adjusted
#> 4213:         NA        1 Consumer Price Index, monthly, seasonally adjusted

Created on 2024-01-01 with reprex v2.0.2

The message from vroom "call problems() on your data frame for details," doesn't work, because the details have been removed by the time the dataset is returned, and I don't see a way to follow the advice to "Specify the column types or set show_col_types = FALSE to quiet this message.".

dmurdoch commented 11 months ago

I've taken a closer look, and I see this in the metadata file being read here:

"Cube Title","Product Id","CANSIM Id",URL,"Cube Notes","Archive Status",Frequency,"Start Reference Period","End Reference Period","Total number of dimensions"
"Consumer Price Index, monthly, seasonally adjusted","18100006","326-0022","https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1810000601",1;2;3;4;6;7;10,"CURRENT - a cube available to the public and that is current","Monthly","1992-01-01","2023-11-01","2",

"Dimension ID","Dimension name","Dimension Notes","Dimension Definitions"
"1","Geography",,""
"2","Products and product groups",10,""

followed by more lines defining other things. I think there are two issues here that cause the warning:

  1. The statcan_data function only uses the first two lines at this point, and shouldn't be reading the rest of the file. This can be fixed by setting n_max = 1 in the read_csv call.
  2. The metadata has 10 fields in the header on line 1, and 10 fields followed by a comma on line 2, so read_csv sees it as 11 fields.

Problem 2 is harder to deal with. The User Guide https://www.statcan.gc.ca/en/developers/csv/user-guide is unclear about whether this is normal or an error at StatCan. It says there are two kinds of metadata: non-census cubes and census cubes, with different numbers of fields (10 vs 12), so reading exactly 10 fields would mess up census cubes.