Open mrwunderbar666 opened 1 year ago
You should use split()
to make a list. @stefan-mueller let's add to https://tutorials.quanteda.io/basic-operations/tokens/tokens/
library(quanteda)
#> Package version: 4.0.0
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
data("brussels_reviews_anno", package = "udpipe")
lis <- split(brussels_reviews_anno$token, brussels_reviews_anno$doc_id)
toks <- as.tokens(lis)
head(toks)
#> Tokens consisting of 6 documents.
#> 10049756 :
#> [1] "Muy" "buena" "estadia" "," "la"
#> [6] "habitacion" "donde" "nos" "hospedamos" "es"
#> [11] "muy" "amplia"
#> [ ... and 189 more ]
#>
#> 10061484 :
#> [1] "Muy" "buen" "departamento,en" "una"
#> [5] "excelente" "ubicacion" "." "Jacques"
#> [9] "es" "un" "buen" "buen"
#> [ ... and 34 more ]
#>
#> 10066128 :
#> [1] "Nous" "avons" "passe" "un" "excellent"
#> [6] "sejour" "dans" "ce" "tres" "joli"
#> [11] "appartement" "."
#> [ ... and 43 more ]
#>
#> 10114635 :
#> [1] "La" "casa" "es" "muy" "comoda" ","
#> [7] "estaba" "muy" "limpia" "," "situacion" "perfecta"
#> [ ... and 82 more ]
#>
#> 10120339 :
#> [1] "Sejour" "parfait" "chez" "Olivier" "."
#> [6] "Un" "hote" "attentionne" "," "disponible"
#> [11] "et" "accueillant"
#> [ ... and 34 more ]
#>
#> 10160362 :
#> [1] "Apartamento" "que" "esta" "cerca" "del"
#> [6] "centro" "." "Tiene" "todo" "lo"
#> [11] "que" "necesitas"
#> [ ... and 124 more ]
Since the udpipe output is almost the same structure as that from spacyr::spacy_parse()
, it can use the method for as.tokens.spacyr_parsed()
. If you want the POS tag appended, you have to slightly rename the udpipe output.
library("quanteda")
#> Package version: 4.0.0
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 12 of 12 threads used.
#> See https://quanteda.io for tutorials and examples.
# example data from the udpipe package
data("brussels_reviews_anno", package = "udpipe")
toks_plain <- brussels_reviews_anno |>
getS3method("as.tokens", class = "spacyr_parsed")()
print(toks_plain, 3, 6)
#> Tokens consisting of 1,500 documents.
#> 32198807 :
#> [1] "Gwen" "fue" "una" "magnifica" "anfitriona"
#> [6] "."
#> [ ... and 111 more ]
#>
#> 12919832 :
#> [1] "Aurelie" "fue" "muy" "atenta" "y"
#> [6] "comunicativa"
#> [ ... and 41 more ]
#>
#> 23786310 :
#> [1] "La" "estancia" "fue" "muy" "agradable" "."
#> [ ... and 60 more ]
#>
#> [ reached max_ndoc ... 1,497 more documents ]
toks_pos <- dplyr::rename(brussels_reviews_anno, pos = upos) |>
getS3method("as.tokens", class = "spacyr_parsed")(include_pos = "pos")
print(toks_pos, 3, 6)
#> Tokens consisting of 1,500 documents.
#> 32198807 :
#> [1] "Gwen/NOUN" "fue/VERB" "una/DET" "magnifica/NOUN"
#> [5] "anfitriona/ADJ" "./PUNCT"
#> [ ... and 111 more ]
#>
#> 12919832 :
#> [1] "Aurelie/NOUN" "fue/VERB" "muy/ADV" "atenta/ADJ"
#> [5] "y/CONJ" "comunicativa/ADJ"
#> [ ... and 41 more ]
#>
#> 23786310 :
#> [1] "La/DET" "estancia/NOUN" "fue/VERB" "muy/ADV"
#> [5] "agradable/ADJ" "./PUNCT"
#> [ ... and 60 more ]
#>
#> [ reached max_ndoc ... 1,497 more documents ]
Created on 2023-09-29 with reprex v2.0.2
as.tokens.spacyr_parsed()
should be just as.tokens.data.frame()
so that people can use it more broadly.
I thought of that too, then we would just need the equivalents of docid_field = "doc_id", tokenid_field = "token", pos_field = "pos"
etc. The udpipe output almost matches the spacyr_parsed
column names but not for the POS tag, hence the renaming in my code above.
Hi,
Once in a while I happen to get a dataset that is already pre-tokenized (dataframe with columns for tokens and with doc_id). Every time that happens I need to search forever to figure out how to coerce that format to something that quanteda likes.
Maybe it is already in the docs, but google fails me when I try to search for that.
My solution is this one, but I am not sure whether that is the best way: