Add short section on how to import pre-tokenized text

mrwunderbar666 commented 1 year ago

Hi,

Once in a while I happen to get a dataset that is already pre-tokenized (dataframe with columns for tokens and with doc_id). Every time that happens I need to search forever to figure out how to coerce that format to something that quanteda likes.

Maybe it is already in the docs, but google fails me when I try to search for that.

My solution is this one, but I am not sure whether that is the best way:

library(quanteda)
library(BTM)

# example data from the BTM package
data("brussels_reviews_anno")

# cast tokenized data to list
tmp_list <- aggregate(token ~ doc_id, data = brussels_reviews_anno, FUN = "list")

# unpack data and create named list
l <- tmp_list$token
names(l) <- tmp_list$doc_id

# transform to quanteda dfm
converted_corpus <- l |> quanteda::as.tokens() |> 
  quanteda::dfm()

koheiw commented 1 year ago

You should use split() to make a list. @stefan-mueller let's add to https://tutorials.quanteda.io/basic-operations/tokens/tokens/

library(quanteda)
#> Package version: 4.0.0
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.

data("brussels_reviews_anno", package = "udpipe")
lis <- split(brussels_reviews_anno$token, brussels_reviews_anno$doc_id)
toks <- as.tokens(lis)
head(toks)
#> Tokens consisting of 6 documents.
#> 10049756 :
#>  [1] "Muy"        "buena"      "estadia"    ","          "la"        
#>  [6] "habitacion" "donde"      "nos"        "hospedamos" "es"        
#> [11] "muy"        "amplia"    
#> [ ... and 189 more ]
#> 
#> 10061484 :
#>  [1] "Muy"             "buen"            "departamento,en" "una"            
#>  [5] "excelente"       "ubicacion"       "."               "Jacques"        
#>  [9] "es"              "un"              "buen"            "buen"           
#> [ ... and 34 more ]
#> 
#> 10066128 :
#>  [1] "Nous"        "avons"       "passe"       "un"          "excellent"  
#>  [6] "sejour"      "dans"        "ce"          "tres"        "joli"       
#> [11] "appartement" "."          
#> [ ... and 43 more ]
#> 
#> 10114635 :
#>  [1] "La"        "casa"      "es"        "muy"       "comoda"    ","        
#>  [7] "estaba"    "muy"       "limpia"    ","         "situacion" "perfecta" 
#> [ ... and 82 more ]
#> 
#> 10120339 :
#>  [1] "Sejour"      "parfait"     "chez"        "Olivier"     "."          
#>  [6] "Un"          "hote"        "attentionne" ","           "disponible" 
#> [11] "et"          "accueillant"
#> [ ... and 34 more ]
#> 
#> 10160362 :
#>  [1] "Apartamento" "que"         "esta"        "cerca"       "del"        
#>  [6] "centro"      "."           "Tiene"       "todo"        "lo"         
#> [11] "que"         "necesitas"  
#> [ ... and 124 more ]

kbenoit commented 1 year ago

Since the udpipe output is almost the same structure as that from spacyr::spacy_parse(), it can use the method for as.tokens.spacyr_parsed(). If you want the POS tag appended, you have to slightly rename the udpipe output.

library("quanteda")
#> Package version: 4.0.0
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 12 of 12 threads used.
#> See https://quanteda.io for tutorials and examples.

# example data from the udpipe package
data("brussels_reviews_anno", package = "udpipe")

toks_plain <- brussels_reviews_anno |>
    getS3method("as.tokens", class = "spacyr_parsed")()
print(toks_plain, 3, 6)
#> Tokens consisting of 1,500 documents.
#> 32198807 :
#> [1] "Gwen"       "fue"        "una"        "magnifica"  "anfitriona"
#> [6] "."         
#> [ ... and 111 more ]
#> 
#> 12919832 :
#> [1] "Aurelie"      "fue"          "muy"          "atenta"       "y"           
#> [6] "comunicativa"
#> [ ... and 41 more ]
#> 
#> 23786310 :
#> [1] "La"        "estancia"  "fue"       "muy"       "agradable" "."        
#> [ ... and 60 more ]
#> 
#> [ reached max_ndoc ... 1,497 more documents ]

toks_pos <- dplyr::rename(brussels_reviews_anno, pos = upos) |>
    getS3method("as.tokens", class = "spacyr_parsed")(include_pos = "pos")
print(toks_pos, 3, 6)
#> Tokens consisting of 1,500 documents.
#> 32198807 :
#> [1] "Gwen/NOUN"      "fue/VERB"       "una/DET"        "magnifica/NOUN"
#> [5] "anfitriona/ADJ" "./PUNCT"       
#> [ ... and 111 more ]
#> 
#> 12919832 :
#> [1] "Aurelie/NOUN"     "fue/VERB"         "muy/ADV"          "atenta/ADJ"      
#> [5] "y/CONJ"           "comunicativa/ADJ"
#> [ ... and 41 more ]
#> 
#> 23786310 :
#> [1] "La/DET"        "estancia/NOUN" "fue/VERB"      "muy/ADV"      
#> [5] "agradable/ADJ" "./PUNCT"      
#> [ ... and 60 more ]
#> 
#> [ reached max_ndoc ... 1,497 more documents ]

^{Created on 2023-09-29 with reprex v2.0.2}

koheiw commented 1 year ago

as.tokens.spacyr_parsed() should be just as.tokens.data.frame() so that people can use it more broadly.

kbenoit commented 1 year ago

I thought of that too, then we would just need the equivalents of docid_field = "doc_id", tokenid_field = "token", pos_field = "pos" etc. The udpipe output almost matches the spacyr_parsed column names but not for the POS tag, hence the renaming in my code above.

quanteda / tutorials.quanteda.io

Add short section on how to import pre-tokenized text #106