juba / rainette

R implementation of the Reinert text clustering method
https://juba.github.io/rainette/
53 stars 7 forks source link

Importing separate text files in iramuteq format #26

Closed gabrielparriaux closed 1 year ago

gabrielparriaux commented 1 year ago

Hello,

Is it possible to import more than one file in iramuteq format?

Can we give the path to a directory instead of a single file as an argument to import_corpus_iramuteq() function? Or by default we have to work with only one single file with all the content inside?

I have hundreds of text files and I’m not sure it would be very suitable to put all that content in one single .txt file…

Thanks a lot for your opinion and your help!

juba commented 1 year ago

Hi,

You can't do that directly with import_corpus_iramuteq, but you can do with R and quanteda. Here are different ways to do it :

library(tidyverse)
library(rainette)

# List of files to be imported
files <- list.files("/tmp/test", "*.txt", full.names = TRUE)

# Method 1 : Read all files and concatenate them before importing
txt <- files |> 
    map(read_file) |>
    paste(collapse = "\n\n") 
import_corpus_iramuteq(textConnection(txt))

# Method 2 : Import all files as a list of corpora and combine them
corpora <- files |>
    map(import_corpus_iramuteq)
do.call(c, corpora)

# Method 3 : A bit more complex method to avoid the potential error 
# "Cannot combine corpora with duplicated document names" in method 2
corpora <- files |>
    imap(~{
        corpus <- import_corpus_iramuteq(.x)
        names(corpus) <- paste0(.y, "_", names(corpus))
        corpus
    })
do.call(c, corpora)

Let me know if it is not working for you.

gabrielparriaux commented 1 year ago

Thanks a lot, everything is happening as you describe: methods 1 and 3 work perfectly. Method 2 gives me an error "Error: Cannot combine corpora with duplicated document names". Quite strange because I’m sure I don't have duplicated names for documents in my corpus… anyway, I could import all my files properly, thanks a lot!

juba commented 1 year ago

Glad it's working ! The duplicated names error is due to the fact that by default (if you don't supply an id_var argument to import_corpus_iramuteq), documents are given the same default names text1, text2...