ropensci / rfishbase

R interface to the fishbase.org database
https://docs.ropensci.org/rfishbase
111 stars 40 forks source link

species() function returns numeric variables as character vectors #219

Closed wmorgan485 closed 1 year ago

wmorgan485 commented 3 years ago

When using the species() function today, I noticed that some numeric variables are coming in as character <chr> vectors. This can also be seen in examples of the current README document. For example, under the "Getting Data" section, the command species(trout$Species) returned a tibble where all of the variables are <chr>.

I see something similar, but when I ran the command species(), most of the numeric variables came in correctly, but some (such as SpecCode, DepthRangeShallow, and LongevityWildRef) inappropriately came in as <chr> vectors.

Thanks for your efforts! Bill

Session Info ```r sessionInfo() R version 4.1.0 (2021-05-18) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 11.4 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] dplyr_1.0.7 rfishbase_3.1.9 loaded via a namespace (and not attached): [1] Rcpp_1.0.7 pillar_1.6.2 compiler_4.1.0 dbplyr_2.1.1 prettyunits_1.1.1 [6] progress_1.2.2 tools_4.1.0 bit_4.0.4 digest_0.6.27 RSQLite_2.2.7 [11] jsonlite_1.7.2 evaluate_0.14 memoise_2.0.0 lifecycle_1.0.0 tibble_3.1.3 [16] pkgconfig_2.0.3 rlang_0.4.11 rstudioapi_0.13 DBI_1.1.1 cli_3.0.1 [21] curl_4.3.2 yaml_2.2.1 xfun_0.24 fastmap_1.1.0 withr_2.4.2 [26] stringr_1.4.0 arkdb_0.0.12 httr_1.4.2 knitr_1.33 hms_1.1.0 [31] generics_0.1.0 vctrs_0.3.8 bit64_4.0.5 tidyselect_1.1.1 glue_1.4.2 [36] R6_2.5.0 gh_1.3.0 fansi_0.5.0 rmarkdown_2.9 bookdown_0.22.3 [41] blob_1.2.2 tzdb_0.1.2 readr_2.0.0 purrr_0.3.4 magrittr_2.0.1 [46] ellipsis_0.3.2 htmltools_0.5.1.1 assertthat_0.2.1 utf8_1.2.2 stringi_1.7.3 [51] cachem_1.0.5 crayon_1.4.1 ```
cboettig commented 3 years ago

Yes, apologies; letting the parser guess types from the database works poorly for sparse data; if readr sees all NA at the top it assumes logical type, and then coercion turns any numeric or chr data to NA. So currently the fallback mechanism defaults somewhat aggressively to character vectors, since this is lossless and thus easy for the user to fix.

The next major release will probably move from a tsv backend to a parquet backend, allowing us to preserve types more accurately.

jaseeverett commented 3 years ago

Thanks @cboettig and @wmorgan485

I came across this problem as well. Unfortunately Sealifebase and RFishbase end up with different column types so merging them on the fly is also difficult. I wrote a quick function to convert the required columns to numeric. It might help others. (Note the list of columns may not be exhaustive, but works for what I need).

 fix_species_type <- function(df, server = "fishbase"){
    if(server == "fishbase"){ # Need to convert type of diferent columes depending on database
      nm <- c("SpecCode", "DepthRangeShallow", "CommonLength", "CommonLengthF", "LongevityWildRef", "MaxLengthRef", "DangerousRef")
    } else if(server == "sealifebase"){
      nm <- c("SpecCode", "SpeciesRefNo", "GenCode", "DepthRangeRef", "LongevityWildRef", "Weight")
    }
    df <- df %>% 
      mutate(across(any_of(nm), as.numeric)) # Convert `nm` variables to numeric
  }

  df <- rfishbase::species(server = "fishbase") %>% 
    fix_species_type(server = "fishbase") # Warnings are for converting "NA" to NA

  df <- rfishbase::species(server = "sealifebase") %>% 
    fix_species_type(server = "sealifebase") # Warnings are for converting "NA" to NA