juba / rainette

R implementation of the Reinert text clustering method
https://juba.github.io/rainette/
55 stars 7 forks source link

How to deal with characters (chr) object in memory #2

Closed wilcar closed 4 years ago

wilcar commented 4 years ago

First stage : I am dealing with Gallica OCRs and importing raw text from urls (I dont want to work with txt files)

     library(htm2txt) # a usefull package to import raw text from an html page.
     url <- 'https://gallica.bnf.fr/ark:/12148/bpt6k567105k.texteBrut'
     text <- gettxt(url)  # function from html2text package

Second stage : adding the top and bottom stars (iramuteq syntax)

    stars <- "****"
    text2 <-  paste(stars, text, stars, sep="\n")
    cat(text2)

Third stage ....with an error

   library(rainette)
   library(quanteda)
   corpus <- import_corpus_iramuteq(text2)

   Error in readLines(f) : cannot open the connection

I tried directly :

corpus <- split_segments(text2, segment_size = 40)

but I have this message using dfm function

         dtm <- dfm(corpus, remove = stopwords("en"), tolower = TRUE, remove_punct = TRUE)
        dtm <- dfm_wordstem(dtm, language = "english")
        dtm <- dfm_trim(dtm, min_termfreq = 3)

         Error in dfm.default(corpus, remove = stopwords("en"), tolower = TRUE, : dfm() only works on character, corpus, dfm, tokens objects.

Thank you for helping (in french if you want)

juba commented 4 years ago

Yep, text2 is a character vector, not a connection. Here is how to do it :

library(rainette)
texte <- "****\nBah blah blah\n****\nFoo bar"
con <- textConnection(texte)
corpus <- import_corpus_iramuteq(con)
wilcar commented 4 years ago

C'est bien ça ! je ne connaissais pas la fonction textConnection !