Closed ran88dom99 closed 1 year ago
Can you provide some details please?
What do you mean "get the urls instead of just the count of them?" Example please?
A character vector of the urls like c("https://github.com","https://google.com").
This works already:
library("quanteda")
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
library("quanteda.textstats")
txt <- "This is some sample text with URLs such as https://github.com and https://google.com."
# correctly counts the URLs
tokens(txt) |>
textstat_summary()
#> document chars sents tokens types puncts numbers symbols urls tags emojis
#> 1 text1 NA NA 12 12 0 0 0 2 0 0
# extract the URLs
tokens(txt) |>
tokens_select(pattern = "http*") |>
as.character()
#> [1] "https://github.com" "https://google.com."
Created on 2023-10-05 with reprex v2.0.2
textstat_summary documentation does not mention URL
in addition it counts repeated urls
Is there a function to get the urls instead of just the count of them?