quanteda / quanteda

An R package for the Quantitative Analysis of Textual Data
https://quanteda.io
GNU General Public License v3.0
844 stars 189 forks source link

tokens_lookup() not working as expected? #1347

Closed KyleHaynes closed 6 years ago

KyleHaynes commented 6 years ago

Hi,

Wondering if the below is expected behavior or I'm doing/interpreting something incorrectly?

When there are no matches in the tokens_lookup, shouldn't "NA" be returned (as opposed to "CA") in the below example?

# text
txt <- c("12032 Musgrave rd red hill","13 rad street windermore park queensland","130 right road","130 rtn road")
# tokenise txt
toks <- quanteda::tokens(txt)
# create named list
dic <- list(CR=c("rd","red"), CB=c("street","feet"), CA=c("parl","dark"))
# create dictionary
dict <- quanteda::dictionary(dic)
# apply tokens_lookup
quanteda::tokens_lookup(toks, dict, levels=1, exclusive=T, nomatch="NA")

tokens from 4 documents.
text1 :
[1] "CA" "CA" "CR" "CR" "CA"

text2 :
[1] "CA" "CA" "CB" "CA" "CA" "CA"

text3 :
[1] "CA" "CA" "CA"

text4 :
[1] "CA" "CA" "CA"

Currently using CRAN quanteda_1.2.0 with R3.5.0

Thanks Kyle

koheiw commented 6 years ago

@KyleHaynes Thanks, its a bug. I have written a patch to fix this.

KyleHaynes commented 6 years ago

Patch seems to work. Thanks!