Closed markvanderloo closed 2 years ago
Hi Mark. Thanks for the issue. What is meant by commas is when it is comma separating the denominations (e.g., millions, billions, thousands, hundereds). This is U.S. convention. When I wrote qdapRegex I included a default U.S. dictionary with room for growth by adding additional other locale specific dictionaries via community support. In the README I have:
The functions in qdapRegex work on a dictionary system. The current implementation defaults to a United States flavor of canned regular expressions. Users may submit proposed region specific regular expression dictionaries that contain the same fields as the regex_usa data set or improvements to regular expressions in current dictionaries. Please submit proposed regional regular expression dictionaries via: https://github.com/trinker/qdapRegex/issues
I see there are all sorts of ways decimal marks can be represented. https://en.wikipedia.org/wiki/Decimal_mark
I would love if you were willing to make a Netherlands specific dictionary. I/We could blog/tweet about it and the community support and hopefully get the ball rolling with other locale specific dictionaries from the community if you were willing. I'm guessing a lot of the dictionary for Netherlands would be the same as the U.S. one I made (e.g., IP address is a universal thing) while others would require nly minor tweaks.
So for example with your problem we could use the current regex for U.S. and just swap out the comma and period using the textclean package's swap
function:
library(qdapRegex)
library(textclean)
## make netherlands pattern
textclean::swap(qdapRegex::grab('rm_number'), ',', '.')
## "(?<=^| )[-,]*\\d+(?:\\,\\d+)?(?= |\\,?$)|\\d+(?:.\\d{3})+(\\,\\d+)*"
## make rm_number function for netherlands
rm_number2 <- rm_(pattern = textclean::swap(qdapRegex::grab('rm_number'), ',', '.'))
rm_number2("hello 12,5 world and another 1.234.567,89")
## [1] "hello world and another"
According to the help file it should recognize this:
Here's the
sessionInfo