ncss-tech / soilDB

soilDB: Simplified Access to National Cooperative Soil Survey Databases
http://ncss-tech.github.io/soilDB/
GNU General Public License v3.0
83 stars 19 forks source link

stringsAsFactors default is FALSE on R 4.0+ / not generic solution for factor levels #130

Closed dylanbeaudette closed 2 years ago

dylanbeaudette commented 4 years ago

We are passing around default.stringsAsFactors() to a lot of functions, each with its own arguments, flow-control, and documentation to manage. As far as I can tell (@smroecker please correct me if I am wrong), this is all to handle how uncode() converts or does not convert characters to factors, using the NASIS domains as the default levels. We have discussed at great length in #53, but some issues remain:

Functions to Fix

brownag commented 4 years ago

Saw this coming a couple months ago.

Relevant background: https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/index.html

smroecker commented 4 years ago

Yes I saw that blog also, and expected we would have to update soilDB shortly. uncode() only converts those columns that match the ColumnPhysicalNames from within NASIS's metadata domains into factors. All other character data columns, such as taxonname, are left as characters.

I'm not sure converting the stringsAsFactors argument into an option is the way to go, as that obscures the functionality from your typical user, but I'd be open to if that is the consensus. We should probably still document that functionality in each functions help file details section. If so it won't really reduce the documentation overhead.

I really think it's preferable to maintain parity with base R, therefore I suggest we change the argument name from stringsAsFactors to as.is. Although I think in issue #53 it was decided that the default should be TRUE.

As to the issue of the resulting factor ordering produced by uncode(). I think the best long-term solution would be to create a subroutine within uncode() to correct the ordering for those factors deemed to be 'out of order'. Rather than modify within each and every function where it occurs. The same solution might be used for those variables that should be excluded. I had considered the possibility of going to the source, and fixing the ordering in NASIS with the help of George or Kyle, but... decided that probably wasn't a good idea as it would destroy any backward compatibility with old data.

brownag commented 3 years ago

default.stringsAsFactors() returns FALSE, which means default result of get_cosoilmoist_from_NASISWebReport "status" column value is NULL / all NA. I fixed an example that was breaking in the CI since last week with this stack trace:

  Error: Assigned data `l` must be compatible with existing data.
  x Existing data has 119 rows.
  x Element 9 of assigned data has 0 rows.
  i Only vectors of size 1 are recycled.
  Backtrace:
       x
    1. +-soilDB::get_cosoilmoist_from_NASISWebReport("EVAL - MLRA 111A - Ross silt loam, 0 to 2 percent slopes, frequently flooded")
    2. | \-soilDB:::.cosoilmoist_prep(d.cosoilmoist, impute = impute, stringsAsFactors = stringsAsFactors)
    3. |   +-base::within(...)
    4. |   \-base::within.data.frame(...)
    5. |     +-base::`[<-`(...)
    6. |     \-tibble:::`[<-.tbl_df`(...)
    7. |       \-tibble:::tbl_subassign(x, i, j, value, i_arg, j_arg, substitute(value))
    8. |         \-tibble:::vectbl_recycle_rhs(...)
    9. |           +-base::withCallingHandlers(...)
   10. |           \-vctrs::vec_recycle(value[[j]], nrow)
   11. +-vctrs:::stop_recycle_incompatible_size(...)
   12. | \-vctrs:::stop_vctrs(...)
   13. |   \-rlang::abort(message, class = c(class, "vctrs_error"), ...)
   14. |     \-rlang:::signal_abort(cnd)
   15. |       \-base::signalCondition(cnd)
   16. \-(function (cnd) ...
  Execution halted

Not sure how it suddenly causes a problem, since stringsAsFactors has been FALSE by default since R 4+, and this just started testing bad last week.

Comparison toggling stringsAsFactors back to former default TRUE

daff::diff_data(
  soilDB::get_cosoilmoist_from_NASISWebReport(
    "EVAL - MLRA 111A - Ross silt loam, 0 to 2 percent slopes, frequently flooded",
    stringsAsFactors = FALSE
  ),

  soilDB::get_cosoilmoist_from_NASISWebReport(
    "EVAL - MLRA 111A - Ross silt loam, 0 to 2 percent slopes, frequently flooded",
    stringsAsFactors = TRUE
  )
)
#> Loading required namespace: rvest
#> Warning in eval(substitute(expr), e): NAs introduced by coercion

#> Warning in eval(substitute(expr), e): NAs introduced by coercion
#> Daff Comparison: 'soilDB::get_cosoilmoist_from_NASISWebReport("EVAL - MLRA 111A - Ross silt loam, 0 to 2 percent slopes, frequently flooded", ' '    stringsAsFactors = FALSE)' vs. 'soilDB::get_cosoilmoist_from_NASISWebReport("EVAL - MLRA 111A - Ross silt loam, 0 to 2 percent slopes, frequently flooded", ' '    stringsAsFactors = TRUE)' 
#>   First 6 and last 6 patch lines:
#>     nationalmusym muname                                                   
#> ->  5d8l          Ross silt loam                                           
#> ->  5d8l          Ross silt loam                                           
#> ->  5d8l          Ross silt loam                                           
#> ->  5d8l          Ross silt loam                                           
#> ->  5d8l          Ross silt loam                                           
#> ->  5d8l          Ross silt loam                                           
#> ... ...           ...                                                      
#> ->  2w56h         Ross silt loam, 0 to 2 percent slopes, frequently flooded
#> ->  2w56h         Ross silt loam, 0 to 2 percent slopes, frequently flooded
#> ->  2w56h         Ross silt loam, 0 to 2 percent slopes, frequently flooded
#> ->  2w56h         Ross silt loam, 0 to 2 percent slopes, frequently flooded
#> ->  2w56h         Ross silt loam, 0 to 2 percent slopes, frequently flooded
#> ->  2w56h         Ross silt loam, 0 to 2 percent slopes, frequently flooded
#>     dmuiid coiid   compname comppct_r drainagecl  month
#> ->  119696 238422  Ross     100       well        jan  
#> ->  119696 238422  Ross     100       well        feb  
#> ->  119696 238422  Ross     100       well        mar  
#> ->  119696 238422  Ross     100       well        apr  
#> ->  119696 238422  Ross     100       well        may  
#> ->  119696 238422  Ross     100       well        jun  
#> ... ...    ...     ...      ...       ...         ...  
#> ->  749825 2494482 Sloan    5         very poorly sep  
#> ->  749825 2494482 Sloan    5         very poorly oct  
#> ->  749825 2494482 Sloan    5         very poorly nov  
#> ->  749825 2494482 Sloan    5         very poorly nov  
#> ->  749825 2494482 Sloan    5         very poorly dec  
#> ->  749825 2494482 Sloan    5         very poorly dec  
#>     flodfreqcl               floddurcl  pondfreqcl                ponddurcl
#> ->  Not populated->frequent  very brief Not populated->none       <NA>     
#> ->  Not populated->frequent  very brief Not populated->none       <NA>     
#> ->  Not populated->frequent  very brief Not populated->none       <NA>     
#> ->  Not populated->frequent  very brief Not populated->none       <NA>     
#> ->  Not populated->frequent  very brief Not populated->none       <NA>     
#> ->  Not populated->very rare very brief Not populated->none       <NA>     
#> ... ...                      ...        ...                       ...      
#> ->  Not populated            <NA>       Not populated             <NA>     
#> ->  Not populated            <NA>       Not populated             <NA>     
#> ->  Not populated->frequent  brief      Not populated->occasional brief    
#> ->  Not populated->frequent  brief      Not populated->occasional brief    
#> ->  Not populated->frequent  brief      Not populated->occasional brief    
#> ->  Not populated->frequent  brief      Not populated->occasional brief    
#>     cosoilmoistiid dept_l  dept_r  dept_h  depb_l  depb_r  depb_h 
#> ->  NA             NA->201 NA->201 NA->201 NA->201 NA->201 NA->201
#> ->  NA             NA->201 NA->201 NA->201 NA->201 NA->201 NA->201
#> ->  NA             NA->201 NA->201 NA->201 NA->201 NA->201 NA->201
#> ->  NA             NA->201 NA->201 NA->201 NA->201 NA->201 NA->201
#> ->  NA             NA->201 NA->201 NA->201 NA->201 NA->201 NA->201
#> ->  NA             NA->201 NA->201 NA->201 NA->201 NA->201 NA->201
#> ... ...            ...     ...     ...     ...     ...     ...    
#> ->  23327632       0       0       0       200     200     200    
#> ->  23327616       0       0       0       200     200     200    
#> ->  23327627       0       0       0       0       7       15     
#> ->  23327628       0       7       15      200     200     200    
#> ->  23327630       0       0       0       0       7       15     
#> ->  23327629       0       7       15      200     200     200    
#>     status             
#> ->  NULL->Not populated
#> ->  NULL->Not populated
#> ->  NULL->Not populated
#> ->  NULL->Not populated
#> ->  NULL->Not populated
#> ->  NULL->Not populated
#> ... ...                
#> ->  NULL->moist        
#> ->  NULL->moist        
#> ->  NULL->moist        
#> ->  NULL->wet          
#> ->  NULL->moist        
#> ->  NULL->wet
smroecker commented 3 years ago

Thanks Andrew. I looked into and fixed. I'll try and examine how do deal with all of the other functions shortly.