Any chance we can get parallel processing for n-grams for example? #201

jaymon0703 commented 2 years ago

Thanks for a great package by the way

juliasilge commented 2 years ago

I'm not sure that identifying n-grams is parallelizable in a straightforward way, since we need to slide along the text to find the overlapping tokens. You could do something like this using furrr, if you wanted to find n-grams for separate documents using parallel processing.

#> Loading required package: future

## nest by document
nested_austen <- janeaustenr::austen_books() %>% 
  mutate(title = book) %>%
  nest(data = c(title, text))

#> # A tibble: 6 × 2
#>   book                data                 
#>   <fct>               <list>               
#> 1 Sense & Sensibility <tibble [12,624 × 2]>
#> 2 Pride & Prejudice   <tibble [13,030 × 2]>
#> 3 Mansfield Park      <tibble [15,349 × 2]>
#> 4 Emma                <tibble [16,235 × 2]>
#> 5 Northanger Abbey    <tibble [7,856 × 2]> 
#> 6 Persuasion          <tibble [8,328 × 2]>

plan(multisession, workers = 2)

tokenized <- 
  nested_austen %>%
  mutate(tokens = future_map(
    ~ unnest_tokens(., bigram, text, collapse = "title", token = "ngrams", n = 2)

tokenized %>%
  select(tokens) %>%
#> # A tibble: 725,049 × 2
#>    title               bigram         
#>    <fct>               <chr>          
#>  1 Sense & Sensibility sense and      
#>  2 Sense & Sensibility and sensibility
#>  3 Sense & Sensibility sensibility by 
#>  4 Sense & Sensibility by jane        
#>  5 Sense & Sensibility jane austen    
#>  6 Sense & Sensibility austen 1811    
#>  7 Sense & Sensibility 1811 chapter   
#>  8 Sense & Sensibility chapter 1      
#>  9 Sense & Sensibility 1 the          
#> 10 Sense & Sensibility the family     
#> # … with 725,039 more rows

In the general case, there is a fair amount of complexity in specifying what chunks of text should go to parallel workers.

jaymon0703 commented 2 years ago

Thanks Julia. You are right. This may be more effort than it is worth. I would appreciate others' thoughts before making a decision on whether or not to close the issue.

juliasilge commented 2 years ago

Let me know if you have further questions!

