trinker / qdapRegex

qdapRegex is a collection of regular expression tools associated with the qdap package that may be useful outside of the context of discourse analysis.
50 stars 4 forks source link

Extract numbers with digit range #33

Closed bes827 closed 2 years ago

bes827 commented 4 years ago

Hello This package has been very helpful and efficient, I am able to achieve a lot with much less code than with stringr or with other packages. I am trying to perform the task below which is a bit complex:

I have a large number of text files that I converted to a dataset and now trying to extract a particular number (serial number, the unique pattern for this is the first occurrence of a number between 5-8 digits). I tried couple of codes, including the sapply function you posted in the previous question, but no luck so far. The issues I am running through are:

thanks a lot

#create sample dataframe:
text = c("name: xyz, abc  age: 23, serial: 12345, dob: 1/1/2011, other: 0000000" , "name: aaa, bbb 
age: 21, serial: 123456, DOB: 1/2/1234", "name: ccc, ddd 
age:42
number: 1234567
dob: 1/1/111")

df <- data.frame (text)

# attempt to extract the number in anew variable: 

library(qdapRegex)

df$serial = sapply(qdapRegex::rm_number(df$text, pattern = "(?<!\\d)\\d{5}(?!\\d)",  extract=TRUE) , `[`, 1)

df$serial
trinker commented 4 years ago

unable to find the regex to define the number of digits ranging (5-8), the example below only includes the 5 digits.

\\d{5,8} says 5 to 8 digits

I believe that the code I use does not search in text after a new line, is there a way to fix that?

Not sure what you mean. Can you povide an example that fails and your desired output?

trinker commented 2 years ago

Closing because OP never responded for clarification