slu-openGIS / postmastr

R package for Processing and Parsing Untidy Street Addresses
https://slu-opengis.github.io/postmastr/
GNU General Public License v3.0
37 stars 8 forks source link

pm_streetSuf_parse() fails to identify many street suffixes #23

Open aaronmams opened 2 years ago

aaronmams commented 2 years ago

Describe the bug pm_streetSuf_parse() does not identify many street suffixes as illustrated in the vignette.

I suspect this failure is possibly related to the current inability of the package to identify unit numbers.

Specific example: the pm_streetSuf_parse() method does not identify the street suffix "Drive" in the address, "310 Westline Drive, APT. 201B"

Expected Behavior I guess I expected the street suffixes "DRVIE", "HWY", "RD", and "ROAD" from the example below to be identified and parsed.

I have verified that these string values are present in the street Suffix dictionary.

To Reproduce

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(postmastr)
library(stringr)
eidl_addresses <- c("98-199 KAMEHAMEHA HWY. E1","8928 S. LACLEDE STATION RD.","9785 MACKENZIE ROAD, SUITE 100","29805 MARLIS ST",
                    "310 WESTLINE DRIVE, APT. 201B")
addresses <- data.frame(eidl_addresses)
addresses <- addresses %>% pm_identify(var = eidl_addresses)
addresses <- addresses %>% pm_prep(var=eidl_addresses,type="short")
addresses <- addresses %>% pm_house_parse()
addresses
#> # A tibble: 5 x 3
#>   pm.uid pm.address               pm.house
#>    <int> <chr>                    <chr>   
#> 1      1 KAMEHAMEHA HWY. E1       98-199  
#> 2      2 S. LACLEDE STATION RD.   8928    
#> 3      3 MACKENZIE ROAD SUITE 100 9785    
#> 4      4 MARLIS ST                29805   
#> 5      5 WESTLINE DRIVE APT. 201B 310
addresses <- addresses %>% pm_streetDir_parse()
addresses
#> # A tibble: 5 x 4
#>   pm.uid pm.address               pm.house pm.preDir
#>    <int> <chr>                    <chr>    <chr>    
#> 1      1 KAMEHAMEHA HWY. E1       98-199   <NA>     
#> 2      2 LACLEDE STATION RD.      8928     S        
#> 3      3 MACKENZIE ROAD SUITE 100 9785     <NA>     
#> 4      4 MARLIS ST                29805    <NA>     
#> 5      5 WESTLINE DRIVE APT. 201B 310      <NA>
addresses <- addresses %>% pm_streetSuf_parse()
addresses
#> # A tibble: 5 x 5
#>   pm.uid pm.address               pm.house pm.preDir pm.streetSuf
#>    <int> <chr>                    <chr>    <chr>     <chr>       
#> 1      1 KAMEHAMEHA HWY. E1       98-199   <NA>      <NA>        
#> 2      2 LACLEDE STATION RD.      8928     S         <NA>        
#> 3      3 MACKENZIE ROAD SUITE 100 9785     <NA>      <NA>        
#> 4      4 MARLIS                   29805    <NA>      St          
#> 5      5 WESTLINE DRIVE APT. 201B 310      <NA>      <NA>
addresses <- addresses %>% pm_street_parse()
#> Error in get(genname, envir = envir) : object 'testthat_print' not found
addresses
#> # A tibble: 5 x 5
#>   pm.uid pm.house pm.preDir pm.street                pm.streetSuf
#>    <int> <chr>    <chr>     <chr>                    <chr>       
#> 1      1 98-199   <NA>      Kamehameha Hwy E1        <NA>        
#> 2      2 8928     S         Laclede Station Rd       <NA>        
#> 3      3 9785     <NA>      Mackenzie Road Suite 100 <NA>        
#> 4      4 29805    <NA>      Marlis                   St          
#> 5      5 310      <NA>      Westline Drive Apt 201b  <NA>
pm_dictionary(type="suffix")[str_detect("HWY",pm_dictionary(type="suffix")$suf.input),]
#> # A tibble: 2 x 3
#>   suf.type suf.input suf.output
#>   <chr>    <chr>     <chr>     
#> 1 Highway  HWY       Hwy       
#> 2 Way      WY        Way

Note that if the Apartment Number is removed from the 5th entry, the street suffix is identified:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(postmastr)
library(stringr)
eidl_addresses <- c("98-199 KAMEHAMEHA HWY. E1","8928 S. LACLEDE STATION RD.","9785 MACKENZIE ROAD, SUITE 100","29805 MARLIS ST",
                    "310 WESTLINE DRIVE")
addresses <- data.frame(eidl_addresses)
addresses <- addresses %>% pm_identify(var = eidl_addresses)
addresses <- addresses %>% pm_prep(var=eidl_addresses,type="short")
addresses <- addresses %>% pm_house_parse()
addresses
#> # A tibble: 5 x 3
#>   pm.uid pm.address               pm.house
#>    <int> <chr>                    <chr>   
#> 1      1 KAMEHAMEHA HWY. E1       98-199  
#> 2      2 S. LACLEDE STATION RD.   8928    
#> 3      3 MACKENZIE ROAD SUITE 100 9785    
#> 4      4 MARLIS ST                29805   
#> 5      5 WESTLINE DRIVE           310
addresses <- addresses %>% pm_streetDir_parse()
addresses
#> # A tibble: 5 x 4
#>   pm.uid pm.address               pm.house pm.preDir
#>    <int> <chr>                    <chr>    <chr>    
#> 1      1 KAMEHAMEHA HWY. E1       98-199   <NA>     
#> 2      2 LACLEDE STATION RD.      8928     S        
#> 3      3 MACKENZIE ROAD SUITE 100 9785     <NA>     
#> 4      4 MARLIS ST                29805    <NA>     
#> 5      5 WESTLINE DRIVE           310      <NA>
addresses <- addresses %>% pm_streetSuf_parse()
addresses
#> # A tibble: 5 x 5
#>   pm.uid pm.address               pm.house pm.preDir pm.streetSuf
#>    <int> <chr>                    <chr>    <chr>     <chr>       
#> 1      1 KAMEHAMEHA HWY. E1       98-199   <NA>      <NA>        
#> 2      2 LACLEDE STATION RD.      8928     S         <NA>        
#> 3      3 MACKENZIE ROAD SUITE 100 9785     <NA>      <NA>        
#> 4      4 MARLIS                   29805    <NA>      St          
#> 5      5 WESTLINE                 310      <NA>      Dr

Desktop (please complete the following information):

sessionInfo()
#> R version 4.0.2 (2020-06-22)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19043)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_4.0.2    magrittr_2.0.1    tools_4.0.2       htmltools_0.5.1.1
#>  [5] yaml_2.2.1        stringi_1.5.3     rmarkdown_2.4     highr_0.8        
#>  [9] knitr_1.30        stringr_1.4.0     xfun_0.18         digest_0.6.27    
#> [13] rlang_0.4.10      evaluate_0.14
chris-prener commented 2 years ago

Thanks for reaching out - I can confirm that the incomplete units workflow would address this. Unfortunately, I don't have a development timeline - this project got back-burnered due to the pandemic.