unDocUMeantIt / koRpus

An R Package for Text Analysis
GNU General Public License v3.0
45 stars 6 forks source link

Moving from tm object to koRpus object and vice versa #6

Closed giorjet closed 7 years ago

giorjet commented 7 years ago

I have a problem moving from a tm object to a koRpus object. I have to normalize a corpus with tm tools, lemmatize the results with koRpus and return to tm to categorize the results. In order to do this I have to transform the tm object into a R dataframe, which I then transform into an excel file, then into a txt file, and finally into a koRpus object. This is the code:

#from VCORPUS to DATAFRAME 
dataframeD610P<-data.frame(text=unlist(sapply(Corpus.TotPOS, `[`, "content")), stringsAsFactors=F)

#from DATAFRAME to XLSX 
#library(xlsx)
write.xlsx(dataframeD610P$text, ".\\mycorpus.xlsx")

#open with excel 
#save in csv (UTF-8)

#import in KORPUS and lemmatization with KORPUS/TREETAGGER 

tagged.results <- treetag(".\\mycorpus.csv", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"),
                          TT.options=list(path="C:/TreeTagger", preset="it-utf8", no.unknown=T)) 

Then I need to do it all backwards to get back to tm. This is the code:

#from KORPUS to TXT 
write.table(tagged.results@TT.res$lemma, ".\\mycorpusLEMMATIZED.txt")

#open with a text editor and formatting of the text

#from TXT to R
Lemma1.POS<- readLines(".\\mycorpusLEMMATIZEDfrasi.txt", encoding = "UTF-8")

#from R object to DATAFRAME
Lemma2.POS<-as.data.frame(Lemma1.POS, encoding = "UTF-8")

#from DATAFRAME to CORPUS
CorpusPOSlemmaFINAL = Corpus(VectorSource(Lemma2.POS$Lemma1.POS))

Is there a more elegant solution to do this without leaving R? I’d really appreciate any help or feedback.

unDocUMeantIt commented 7 years ago

i have started working on a compatibility package: https://github.com/unDocUMeantIt/tm.plugin.koRpus/tree/develop

the actual migration between koRpus and tm objects is not well tested at the moment, i myself am using the package mostly to call koRpus methods on full corpora instead of single texts. but i think that package would be a good place to start. feel free to report issues and feature requests. i can't promise anything, especially in the near future, but i'll sure try. koRpus and tm both have a totally different philosophy with regards to text/object handling, and different technical solutions as well (S4 vs. S3), so it's not really a trivial task getting them to communicate with each other.

you will have to update koRpus to a more recent version (=> 0.07-1) to be able to use it. but i recommend that anyway, becase there's tons of improvements (i haven't had the time to go through the CRAN release procedure yet, but you can find up-to-date releases in my own repository: https://reaktanz.de/R/ )

giorjet commented 7 years ago

Thank you so much. Definitely I'll try it

giorjet commented 7 years ago

Hi, I don't understand how to import from a tm corpus and how to export into a tm corpus... Any syntax suggestion? Thank you

unDocUMeantIt commented 7 years ago

there currently is a stub function called kRpSource() that was supposed to turn a koRpus text object into a tm Source object. however, kRpSource() seems to be defunct for the time being. but you can use the following to achieve something similar:

kRp2VCorpus <- function(obj){
  thisText <- VCorpus(
    VectorSource(
      kRp.text.paste(obj)
    ),
    readerControl=list(language=language(obj))
  )
  return(thisText)
}

# then use the function like this on a tagged text object:
tmCorpusObject <- kRp2VCorpus(koRpusTaggedTextObject)

for the other way around, you could try to use the text "content" of tm corpus objects, e.g. treetag(content(tmCorpusObject[["1"]]), format="obj"). in the mid term, i'm planning to write a wrapper that does this internally so you can use tm methods on koRpus objects intuitively.

giorjet commented 7 years ago

Great! kRp2VCorpus works. Thank so much But now I have another problem: I tried:

tmCorpusObject1<-treetag(content(tmCorpusObject0[["1"]]), format="obj", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"),
                         TT.options=list(path="C:/TreeTagger", preset="it", no.unknown=T))

but this is the answer:

Error in paste(TT.splitter, "perl ", TT.tokenizer, TT.tknz.opts, TT.call.file,  : 
  object 'TT.call.file' not found

and now my syntax doesn't work also using as input a csv file:

tagged.korpus <- treetag(".\\TotPOS16.csv", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"),
                              TT.options=list(path="C:/TreeTagger", preset="it-utf8", no.unknown=T)) 

Are there any syntax changes with the 0.07-2 update of Korpus? (with version 0.06-5 it worked)

Thank You

giorjet commented 7 years ago

...another thing.. the function:

kRp2VCorpus <- function(obj){
  thisText <- VCorpus(
    VectorSource(
      kRp.text.paste(obj)
    ),
    readerControl=list(language=language(obj))
  )
  return(thisText)
}

tmCorpusObject.TotPOS16 <- kRp2VCorpus(tagged.TotPOS16results)

works very well but it eliminates some blanks before or after punctuation or other characters like "-" giving some problem to my analisys. Have you any solutions? Thank You

unDocUMeantIt commented 7 years ago

right, there's a bug that was introduced with 0.07-1 only to the windows version of koRpus. it slipped through with changes needed to support portuguese, was discovered in january and is fixed in the develop branch.

i'll release a fixed version 0.10-1 as soon as i get roxygen2 running again (i have problems with roxygen2 6.0.1). see here how you can install the develop version directly from github: https://github.com/unDocUMeantIt/koRpus/tree/develop#installation-via-github

i'm sorry for all the trouble -- i don't use windows and most windows users only run the CRAN versions of the package, OS specific bugs are sometimes hard to see.

unDocUMeantIt commented 7 years ago

the main problem with regards to kRp.text.paste() is this: when you give a text to TreeTagger, what you get back is a table with three columns, where the first column is the vector of all tokens in the original text. during this step, you lose information about spaces, paragraphs etc. -- have a look at taggedText(tagged.TotPOS16results).

kRp.text.paste() tries to recreate the original text from that vector of tokens, which of course can't be perfect because it doesn't know how many spaces there were. i have not yet found a better solution for this.

giorjet commented 7 years ago

Thank you, I undestand. Now kRp2VCorpus works! I Have just a problem due to my inexperince with r code: with

tmCorpusObject0<-treetag(content(Corpus.TotPOS16[["1"]]), format="obj", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"),
                         TT.options=list(path="C:/TreeTagger", preset="it", no.unknown=T))

I have only the first document tagged. how can I change (Corpus.TotPOS16[["1"]]) to have all the corpus document tagged?

For the problem of spaces for me It would be enough to put a space between each lemma (also punctuation) and leave intact expressions as "value-for-money" without any break between the words. but if it's impossible I will continue to do it with a text editor

giorjet commented 7 years ago

I noticed another problem with treetag(content(Corpus.TotPOS16[["1"]])... It seems not to be able to manage the Italian accented characters (à, è, ...) that become something like "rapporto-qualit�-prezzo"
If I use treetag(".\\TotPOS16.csv",... the problem does not exist

Thanks in advance

unDocUMeantIt commented 7 years ago

have you triel looping through the tm object with lapply()? that should get you a list of results, e.g.

myList <- lappy(Corpus.TotPOS16, function(x){
  return(treetag(content(x), format="obj"))
})

as for the encoding issue, you will have to try to find the exact step where special characters are being messed up.

giorjet commented 7 years ago

Thank you for the tip, I'll give it a try. Re the encoding issue, it happens when I use "treetag" command with a corpus object (not with a csv file).

unDocUMeantIt commented 7 years ago

it happens when I use "treetag" command with a corpus object (not with a csv file).

yes, but the question remains when exactly the character errors occur. e.g., are the characters already corrupted in the tm corpus? if so, what about the material used to make that object? and so on. at some point, things go wrong. we must find that specific point first, or we have little chance of fixing it.

giorjet commented 7 years ago

Ok now I understand. The characters in tm were ok . Indeed the csv file was product starting from the tm corpus object with:

#from VCORPUS to DATAFRAME 
dataframeD610P<-data.frame(text=unlist(sapply(Corpus.TotPOS, `[`, "content")), stringsAsFactors=F)

#from DATAFRAME to XLSX 
#library(xlsx)
write.xlsx(dataframeD610P$text, ".\\mycorpus.xlsx")

#open with excel 
#save in csv (UTF-8)

#import in KORPUS and lemmatization with KORPUS/TREETAGGER 

tagged.results <- treetag(".\\mycorpus.csv", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"),
                          TT.options=list(path="C:/TreeTagger", preset="it-utf8", no.unknown=T))

but if I use directly treetag with tm object

tmCorpusObject0<-treetag(content(Corpus.TotPOS16[["1"]]), format="obj", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"),
                         TT.options=list(path="C:/TreeTagger", preset="it", no.unknown=T))

the problem occurs

unDocUMeantIt commented 7 years ago

ok, i then suspect the internal workflow of treetag() to be the reason for the character glitches. a problem the function has to deal with is that TreeTagger can't use R character vectors directly. it needs a file to do the analysis. therefore what treetag(..., format="obj") does is first write the text to a temporary file, let TreeTagger analyse the file, and remove the temp file again. the "write text to file" part could be the problem here, if input and output encoding don't match.

does it change anything if you use enc2utf8(content(Corpus.TotPOS16[["1"]])) instead of just content(Corpus.TotPOS16[["1"]])), to force the text input into UTF-8?

giorjet commented 7 years ago

no changes :( ...

tmCorpusObject0@TT.res$lemma
[1] "qualit�"      "scarso"         "qualit�"      "disinteressare" "pericoloso"     "." 
unDocUMeantIt commented 7 years ago

i've changed the way temp files are written a bit in the develop branch. could you please try the following:

  1. with your current installation, does it help explicitly using treetag(..., encoding="UTF-8")? it shouldn't have that effect, but i want to make sure that is the case.
  2. install the current develop version: devtools::install_github("unDocUMeantIt/koRpus", ref="develop") (restart R afterwards to ensure your using the new version)
  3. try with the new treetag(), both with encoding="UTF-8" and without.

does this at least change anything, if not fix it?

what i've tried here is now to force writing the temporary files with UTF-8 encoding if no other encoding is set. so the using of encoding="UTF-8" shouldn't really have an effect (but should you see different results, i'll have to check the code again...).

you could then also set debug=TRUE, which prevents the tempfile from being deleted automatically, so you can inspect it -- is it UTF-8 what you find in that file?

giorjet commented 7 years ago

with the standard version of korpus the addition of encoding="UTF-8":

tmCorpusObject0<-treetag(content(Corpus.TotPOS16[["2355"]]), format="obj", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"), encoding="UTF-8",
                         TT.options=list(path="C:/TreeTagger", preset="it", no.unknown=T))

doesn't work resulting in this error Error in nchar(txt) : invalid multibyte string, element 1

with the dev version the addition of `encoding="UTF-8" works and It seems to recognize accented letters:

tmCorpusObject0@TT.res$lemma
  [1] "spesso"       "alcuni"       "del"          "prodotto"     "migliore"     "non"          "venire"       "più"         "riassortiti" 
 [10] "e"            "si"           "faticare"     "a"            "trovare"      "di"           "simile"       "per"          "colore"      
 [19] "e"            "o"            "qualità "     ","            "alcun"        "colore"       "vistare"      "da"           "catalogo"    
 [28] "differire"    "dal"          "prodotto"     "reale"        ","            "a"            "volta"        "per"          "la"          
 [37] "non"          "curanza"      "del"          "imballaggio"  "e"            "o"            "del"          "corriere"     "arrivare"    
 [46] "prodotto"     "con"          "la"           "scatola"      "rovinare"     "e"            "se"           "essere"       "regale"      
 [55] "per"          "altro"        "persona"      "non"          "essere"       "molto"        "presentabile" ","            "parlare"     
 [64] "anche"        "del"          "prodotto"     "mancare"      "che"          "a"            "volta"        "non"          "arrivare"    
 [73] "perché"      "esaurito"     "o"            "arrivare"     "in"           "un"           "secondo"      "momento"      "perché"     
 [82] "al"           "momento"      "non"          "disponbili"   "in"           "magazzino"    "se"           "servire"      "con"         
 [91] "urgenza"      "bisgona"      "sempre"       "preparare"    "un"           "piano"        "b"            "."            "."           
[100] "INTERRUPTw"   "."           
>

the accented letters are reprinted with combinations of characters but they should be right

ù = ù
à  = à
é = é

(However, the result is the same even if not added encoding="UTF-8")

but now the function

kRp2VCorpus <- function(obj){
  thisText <- VCorpus(
    VectorSource(
      kRp.text.paste(obj)
    ),
    readerControl=list(language=language(obj))
  )
  return(thisText)
}

# then use the function like this on a tagged text object:
tmCorpusObject1 <- kRp2VCorpus(tmCorpusObject0)

does not return the lemma but the token

lapply(tmCorpusObject1[1], as.character) $1 [1] "spesso alcuni dei prodotti migliori non vengono più riassortiti e si fatica a trovarne di simili per colore e o qualità , alcuni colori visti da catalogo differiscono dal prodotto reale, a volte per la non curanza degli imballaggi e o del corriere arrivano prodotti con le scatole rovinate e se sono regali per altre persone non è molto presentabile, parlando anche dei prodotti mancanti che a volte non arrivano perché esauriti o arrivano in un secondo momento perché al momento non disponbili in magazzino se servono con urgenza bisgona sempre prepararsi un piano b. . INTERRUPTw. "

unDocUMeantIt commented 7 years ago

the accented letters are reprinted with combinations of characters but they should be right

does this mean they look funny here on gitHub, or even in your R session? if R doesn't show them correctly, i'm afraid i'm not finished fixing this ;-) could be i've now fixed the output file, but that on windows, getting the tagged input back into koRpus is still broken.

(However, the result is the same even if not added encoding="UTF-8")

yes, that's the way it should be.

but now the function [...] does not return the lemma but the token

hm, i suppose it always has. because kRp.text.paste() always returns tokens (and i haven't touched that function or any object classes). if you only want the lemmata back, you could replace kRp.text.paste() with something like taggedText(obj)[["lemma"]] or paste(taggedText(obj)[["lemma"]]).

giorjet commented 7 years ago

the accented letters are reprinted with combinations of characters but they should be right

even in my R session.. but I think the accents have been kept because if I transform the kRp.tagged object into a txt file: write.table(tmCorpusObject0@TT.res$lemma, ".\\tmCorpusObject.txt") I get:

"x"
"1" "spesso"
"2" "alcuni"
"3" "del"
"4" "prodotto"
"5" "migliore"
"6" "non"
"7" "venire"
"8" "più"
"9" "riassortiti"
"10" "e"
"11" "si"
"12" "faticare"
"13" "a"
"14" "trovare"
"15" "di"
"16" "simile"
"17" "per"
"18" "colore"
"19" "e"
"20" "o"
"21" "qualità"

with the rigth accented letters

but now the function [...] does not return the lemma but the token hm, i suppose it always has

You are right. and now with:

kRp3VCorpus <- function(obj){
  thisText <- VCorpus(
    VectorSource(
      paste(taggedText(obj)[["lemma"]])
    ),
    readerControl=list(language=language(obj))
  )
  return(thisText)
}

I have le lemmas.. but still with combinations of characters in place of accented letters and every token is a document (Is it possible to separate the phrases knowing that at the end of each sentence I added the word ""INTERRUPTw"?)

unDocUMeantIt commented 7 years ago

sorry i didn't reply earlier!

when you're using tokenize() or treetag(), you shouldn't have to mark sentences manually. you can use the POS tags indicating sentence ending punctuation for that (try kRp.POS.tags("it", tags="sentc") or kRp.POS.tags("it", tags="sentc", list.tags=TRUE) to get the tags you need for this). adding your own token for that will probably only invalidate all statistics for the text, because it is counted as a word belonging to the next sentence.

but this seems to be a different issue than the one this started off with. can we close this ticket?

giorjet commented 7 years ago

Yes of course. Thanks