trinker / qdapRegex

qdapRegex is a collection of regular expression tools associated with the qdap package that may be useful outside of the context of discourse analysis.
50 stars 4 forks source link

rm_number does not remove numbers with comma decimal separator #26

Closed markvanderloo closed 2 years ago

markvanderloo commented 6 years ago
qdapRegex::rm_number("hello 12,5 world")
[1] "hello 12,5 world"

According to the help file it should recognize this:

 ‘rm_number’ - Remove/replace/extract number from a string (works
     on numbers with commas, decimals and negatives).

Here's the sessionInfo

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] qdapRegex_0.7.2

loaded via a namespace (and not attached):
[1] compiler_3.4.2  magrittr_1.5    tools_3.4.2     lubridate_1.6.0
[5] stringi_1.1.5   stringr_1.2.0  
> 
trinker commented 6 years ago

Hi Mark. Thanks for the issue. What is meant by commas is when it is comma separating the denominations (e.g., millions, billions, thousands, hundereds). This is U.S. convention. When I wrote qdapRegex I included a default U.S. dictionary with room for growth by adding additional other locale specific dictionaries via community support. In the README I have:

The functions in qdapRegex work on a dictionary system. The current implementation defaults to a United States flavor of canned regular expressions. Users may submit proposed region specific regular expression dictionaries that contain the same fields as the regex_usa data set or improvements to regular expressions in current dictionaries. Please submit proposed regional regular expression dictionaries via: https://github.com/trinker/qdapRegex/issues

I see there are all sorts of ways decimal marks can be represented. https://en.wikipedia.org/wiki/Decimal_mark

I would love if you were willing to make a Netherlands specific dictionary. I/We could blog/tweet about it and the community support and hopefully get the ball rolling with other locale specific dictionaries from the community if you were willing. I'm guessing a lot of the dictionary for Netherlands would be the same as the U.S. one I made (e.g., IP address is a universal thing) while others would require nly minor tweaks.

So for example with your problem we could use the current regex for U.S. and just swap out the comma and period using the textclean package's swap function:

library(qdapRegex)
library(textclean)

## make netherlands pattern
textclean::swap(qdapRegex::grab('rm_number'), ',', '.')
## "(?<=^| )[-,]*\\d+(?:\\,\\d+)?(?= |\\,?$)|\\d+(?:.\\d{3})+(\\,\\d+)*"

## make rm_number function for netherlands
rm_number2 <- rm_(pattern = textclean::swap(qdapRegex::grab('rm_number'), ',', '.'))

rm_number2("hello 12,5 world and another 1.234.567,89")
## [1] "hello world and another"