quanteda / quanteda.textstats

Textual statistics for quanteda
GNU General Public License v3.0
14 stars 2 forks source link

textstat_summary documentation does not mention URL #57

Closed ran88dom99 closed 1 year ago

ran88dom99 commented 1 year ago

textstat_summary documentation does not mention URL

in addition it counts repeated urls

Is there a function to get the urls instead of just the count of them?

kbenoit commented 1 year ago

Can you provide some details please?

What do you mean "get the urls instead of just the count of them?" Example please?

ran88dom99 commented 1 year ago

A character vector of the urls like c("https://github.com","https://google.com").

kbenoit commented 1 year ago

This works already:

library("quanteda")
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
library("quanteda.textstats")

txt <- "This is some sample text with URLs such as https://github.com and https://google.com."

# correctly counts the URLs
tokens(txt) |>
    textstat_summary()
#>   document chars sents tokens types puncts numbers symbols urls tags emojis
#> 1    text1    NA    NA     12    12      0       0       0    2    0      0

# extract the URLs
tokens(txt) |>
    tokens_select(pattern = "http*") |>
    as.character()
#> [1] "https://github.com"  "https://google.com."

Created on 2023-10-05 with reprex v2.0.2