kbenoit / quanteda.dictionaries

Dictionaries for text analysis
Other
71 stars 13 forks source link

Does liwkalike() handle proper regular expressions? #31

Open cjvanlissa opened 4 years ago

cjvanlissa commented 4 years ago

Dear Dr. Benoit,

I tried to run the following:

txt <- c("The red-shirted lawyer gave her yellow-haired, red nose ex-boyfriend $300
            out of pity:(.")
dict <- quanteda::dictionary(list(lawyer = c("\\blawyer\\b", "law.er")))
liwcalike(txt, dict, what = "word", valuetype = "regex")

But the word lawyer is not matched:

docname Segment WPS WC Sixltr Dic lawyer AllPunc Period Comma Colon SemiC QMark Exclam Dash Quote
1   text1       1  24 24   8.33   0      0   29.17   4.17  4.17  4.17     0     0      0 12.5     0
  Apostro Parenth OtherP
1       0       0   12.5`

Is this expected behavior? To what extent are regular expressions supported by liwkalike() and, downstream, tokens_lookup.tokens()?

Thank you sincerely, Caspar

kbenoit commented 4 years ago

Currently, liwcalike() only takes "glob" dictionary patterns, but it would be a reasonable feature request to add valuetype to the function.

To get the equivalent patterns, you would use:

library("quanteda.dictionaries")

txt <- c("The red-shirted lawyer gave her yellow-haired, 
          red nose ex-boyfriend $300 out of pity:(.")
dict <- quanteda::dictionary(list(lawyer = c("lawyer", "law?er")))
liwcalike(txt, dict)
##   docname Segment WPS WC Sixltr  Dic lawyer AllPunc Period Comma Colon SemiC
## 1   text1       1  24 24   8.33 4.17   4.17   29.17   4.17  4.17  4.17     0
##   QMark Exclam Dash Quote Apostro Parenth OtherP
## 1     0      0 12.5     0       0       0   12.5
cjvanlissa commented 4 years ago

Thank you for clarifying! I have a dictionary that makes extensive use of perl regex, so indeed, I would like to put my name down for this feature request :)

Sincerely, Caspar

kbenoit commented 4 years ago

Noted! This will not be hard to add.