Closed giorjet closed 7 years ago
i have started working on a compatibility package: https://github.com/unDocUMeantIt/tm.plugin.koRpus/tree/develop
the actual migration between koRpus
and tm
objects is not well tested at the moment, i myself am using the package mostly to call koRpus
methods on full corpora instead of single texts. but i think that package would be a good place to start. feel free to report issues and feature requests. i can't promise anything, especially in the near future, but i'll sure try. koRpus
and tm
both have a totally different philosophy with regards to text/object handling, and different technical solutions as well (S4 vs. S3), so it's not really a trivial task getting them to communicate with each other.
you will have to update koRpus
to a more recent version (=> 0.07-1) to be able to use it. but i recommend that anyway, becase there's tons of improvements (i haven't had the time to go through the CRAN release procedure yet, but you can find up-to-date releases in my own repository: https://reaktanz.de/R/ )
Thank you so much. Definitely I'll try it
Hi, I don't understand how to import from a tm corpus and how to export into a tm corpus... Any syntax suggestion? Thank you
there currently is a stub function called kRpSource()
that was supposed to turn a koRpus
text object into a tm
Source
object. however, kRpSource()
seems to be defunct for the time being. but you can use the following to achieve something similar:
kRp2VCorpus <- function(obj){
thisText <- VCorpus(
VectorSource(
kRp.text.paste(obj)
),
readerControl=list(language=language(obj))
)
return(thisText)
}
# then use the function like this on a tagged text object:
tmCorpusObject <- kRp2VCorpus(koRpusTaggedTextObject)
for the other way around, you could try to use the text "content" of tm
corpus objects, e.g. treetag(content(tmCorpusObject[["1"]]), format="obj")
. in the mid term, i'm planning to write a wrapper that does this internally so you can use tm
methods on koRpus
objects intuitively.
Great! kRp2VCorpus
works.
Thank so much
But now I have another problem:
I tried:
tmCorpusObject1<-treetag(content(tmCorpusObject0[["1"]]), format="obj", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"),
TT.options=list(path="C:/TreeTagger", preset="it", no.unknown=T))
but this is the answer:
Error in paste(TT.splitter, "perl ", TT.tokenizer, TT.tknz.opts, TT.call.file, :
object 'TT.call.file' not found
and now my syntax doesn't work also using as input a csv file:
tagged.korpus <- treetag(".\\TotPOS16.csv", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"),
TT.options=list(path="C:/TreeTagger", preset="it-utf8", no.unknown=T))
Are there any syntax changes with the 0.07-2 update of Korpus
? (with version 0.06-5 it worked)
Thank You
...another thing.. the function:
kRp2VCorpus <- function(obj){
thisText <- VCorpus(
VectorSource(
kRp.text.paste(obj)
),
readerControl=list(language=language(obj))
)
return(thisText)
}
tmCorpusObject.TotPOS16 <- kRp2VCorpus(tagged.TotPOS16results)
works very well but it eliminates some blanks before or after punctuation or other characters like "-" giving some problem to my analisys. Have you any solutions? Thank You
right, there's a bug that was introduced with 0.07-1 only to the windows version of koRpus
. it slipped through with changes needed to support portuguese, was discovered in january and is fixed in the develop branch.
i'll release a fixed version 0.10-1 as soon as i get roxygen2 running again (i have problems with roxygen2 6.0.1). see here how you can install the develop version directly from github: https://github.com/unDocUMeantIt/koRpus/tree/develop#installation-via-github
i'm sorry for all the trouble -- i don't use windows and most windows users only run the CRAN versions of the package, OS specific bugs are sometimes hard to see.
the main problem with regards to kRp.text.paste()
is this: when you give a text to TreeTagger
, what you get back is a table with three columns, where the first column is the vector of all tokens in the original text. during this step, you lose information about spaces, paragraphs etc. -- have a look at taggedText(tagged.TotPOS16results)
.
kRp.text.paste()
tries to recreate the original text from that vector of tokens, which of course can't be perfect because it doesn't know how many spaces there were. i have not yet found a better solution for this.
Thank you, I undestand.
Now kRp2VCorpus
works!
I Have just a problem due to my inexperince with r code:
with
tmCorpusObject0<-treetag(content(Corpus.TotPOS16[["1"]]), format="obj", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"),
TT.options=list(path="C:/TreeTagger", preset="it", no.unknown=T))
I have only the first document tagged.
how can I change (Corpus.TotPOS16[["1"]])
to have all the corpus document tagged?
For the problem of spaces for me It would be enough to put a space between each lemma (also punctuation) and leave intact expressions as "value-for-money" without any break between the words. but if it's impossible I will continue to do it with a text editor
I noticed another problem with
treetag(content(Corpus.TotPOS16[["1"]])...
It seems not to be able to manage the Italian accented characters (à, è, ...) that become something like "rapporto-qualit�-prezzo"
If I use
treetag(".\\TotPOS16.csv",...
the problem does not exist
Thanks in advance
have you triel looping through the tm
object with lapply()
? that should get you a list of results, e.g.
myList <- lappy(Corpus.TotPOS16, function(x){
return(treetag(content(x), format="obj"))
})
as for the encoding issue, you will have to try to find the exact step where special characters are being messed up.
Thank you for the tip, I'll give it a try. Re the encoding issue, it happens when I use "treetag" command with a corpus object (not with a csv file).
it happens when I use "treetag" command with a corpus object (not with a csv file).
yes, but the question remains when exactly the character errors occur. e.g., are the characters already corrupted in the tm
corpus? if so, what about the material used to make that object? and so on. at some point, things go wrong. we must find that specific point first, or we have little chance of fixing it.
Ok now I understand.
The characters in tm
were ok . Indeed the csv file was product starting from the tm corpus object with:
#from VCORPUS to DATAFRAME
dataframeD610P<-data.frame(text=unlist(sapply(Corpus.TotPOS, `[`, "content")), stringsAsFactors=F)
#from DATAFRAME to XLSX
#library(xlsx)
write.xlsx(dataframeD610P$text, ".\\mycorpus.xlsx")
#open with excel
#save in csv (UTF-8)
#import in KORPUS and lemmatization with KORPUS/TREETAGGER
tagged.results <- treetag(".\\mycorpus.csv", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"),
TT.options=list(path="C:/TreeTagger", preset="it-utf8", no.unknown=T))
but if I use directly treetag with tm object
tmCorpusObject0<-treetag(content(Corpus.TotPOS16[["1"]]), format="obj", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"),
TT.options=list(path="C:/TreeTagger", preset="it", no.unknown=T))
the problem occurs
ok, i then suspect the internal workflow of treetag()
to be the reason for the character glitches. a problem the function has to deal with is that TreeTagger can't use R character vectors directly. it needs a file to do the analysis. therefore what treetag(..., format="obj")
does is first write the text to a temporary file, let TreeTagger analyse the file, and remove the temp file again. the "write text to file" part could be the problem here, if input and output encoding don't match.
does it change anything if you use enc2utf8(content(Corpus.TotPOS16[["1"]]))
instead of just content(Corpus.TotPOS16[["1"]]))
, to force the text input into UTF-8?
no changes :( ...
tmCorpusObject0@TT.res$lemma
[1] "qualit�" "scarso" "qualit�" "disinteressare" "pericoloso" "."
i've changed the way temp files are written a bit in the develop branch. could you please try the following:
treetag(..., encoding="UTF-8")
? it shouldn't have that effect, but i want to make sure that is the case.devtools::install_github("unDocUMeantIt/koRpus", ref="develop")
(restart R afterwards to ensure your using the new version)treetag()
, both with encoding="UTF-8"
and without.does this at least change anything, if not fix it?
what i've tried here is now to force writing the temporary files with UTF-8 encoding if no other encoding is set. so the using of encoding="UTF-8"
shouldn't really have an effect (but should you see different results, i'll have to check the code again...).
you could then also set debug=TRUE
, which prevents the tempfile from being deleted automatically, so you can inspect it -- is it UTF-8 what you find in that file?
with the standard version of korpus the addition of encoding="UTF-8"
:
tmCorpusObject0<-treetag(content(Corpus.TotPOS16[["2355"]]), format="obj", treetagger="manual", lang="it", sentc.end = c(".", "!", "?", ";", ":"), encoding="UTF-8",
TT.options=list(path="C:/TreeTagger", preset="it", no.unknown=T))
doesn't work resulting in this error
Error in nchar(txt) : invalid multibyte string, element 1
with the dev version the addition of `encoding="UTF-8" works and It seems to recognize accented letters:
tmCorpusObject0@TT.res$lemma
[1] "spesso" "alcuni" "del" "prodotto" "migliore" "non" "venire" "più" "riassortiti"
[10] "e" "si" "faticare" "a" "trovare" "di" "simile" "per" "colore"
[19] "e" "o" "qualità " "," "alcun" "colore" "vistare" "da" "catalogo"
[28] "differire" "dal" "prodotto" "reale" "," "a" "volta" "per" "la"
[37] "non" "curanza" "del" "imballaggio" "e" "o" "del" "corriere" "arrivare"
[46] "prodotto" "con" "la" "scatola" "rovinare" "e" "se" "essere" "regale"
[55] "per" "altro" "persona" "non" "essere" "molto" "presentabile" "," "parlare"
[64] "anche" "del" "prodotto" "mancare" "che" "a" "volta" "non" "arrivare"
[73] "perché" "esaurito" "o" "arrivare" "in" "un" "secondo" "momento" "perché"
[82] "al" "momento" "non" "disponbili" "in" "magazzino" "se" "servire" "con"
[91] "urgenza" "bisgona" "sempre" "preparare" "un" "piano" "b" "." "."
[100] "INTERRUPTw" "."
>
the accented letters are reprinted with combinations of characters but they should be right
ù = ù
à = à
é = é
(However, the result is the same even if not added encoding="UTF-8"
)
but now the function
kRp2VCorpus <- function(obj){
thisText <- VCorpus(
VectorSource(
kRp.text.paste(obj)
),
readerControl=list(language=language(obj))
)
return(thisText)
}
# then use the function like this on a tagged text object:
tmCorpusObject1 <- kRp2VCorpus(tmCorpusObject0)
does not return the lemma but the token
lapply(tmCorpusObject1[1], as.character)
$1
[1] "spesso alcuni dei prodotti migliori non vengono più riassortiti e si fatica a trovarne di simili per colore e o qualità , alcuni colori visti da catalogo differiscono dal prodotto reale, a volte per la non curanza degli imballaggi e o del corriere arrivano prodotti con le scatole rovinate e se sono regali per altre persone non è molto presentabile, parlando anche dei prodotti mancanti che a volte non arrivano perché esauriti o arrivano in un secondo momento perché al momento non disponbili in magazzino se servono con urgenza bisgona sempre prepararsi un piano b. . INTERRUPTw. "
the accented letters are reprinted with combinations of characters but they should be right
does this mean they look funny here on gitHub, or even in your R session? if R doesn't show them correctly, i'm afraid i'm not finished fixing this ;-) could be i've now fixed the output file, but that on windows, getting the tagged input back into koRpus is still broken.
(However, the result is the same even if not added encoding="UTF-8")
yes, that's the way it should be.
but now the function [...] does not return the lemma but the token
hm, i suppose it always has. because kRp.text.paste()
always returns tokens (and i haven't touched that function or any object classes). if you only want the lemmata back, you could replace kRp.text.paste()
with something like taggedText(obj)[["lemma"]]
or paste(taggedText(obj)[["lemma"]])
.
the accented letters are reprinted with combinations of characters but they should be right
even in my R session..
but I think the accents have been kept because if I transform the kRp.tagged object into a txt file:
write.table(tmCorpusObject0@TT.res$lemma, ".\\tmCorpusObject.txt")
I get:
"x"
"1" "spesso"
"2" "alcuni"
"3" "del"
"4" "prodotto"
"5" "migliore"
"6" "non"
"7" "venire"
"8" "più"
"9" "riassortiti"
"10" "e"
"11" "si"
"12" "faticare"
"13" "a"
"14" "trovare"
"15" "di"
"16" "simile"
"17" "per"
"18" "colore"
"19" "e"
"20" "o"
"21" "qualità"
with the rigth accented letters
but now the function [...] does not return the lemma but the token hm, i suppose it always has
You are right. and now with:
kRp3VCorpus <- function(obj){
thisText <- VCorpus(
VectorSource(
paste(taggedText(obj)[["lemma"]])
),
readerControl=list(language=language(obj))
)
return(thisText)
}
I have le lemmas.. but still with combinations of characters in place of accented letters and every token is a document (Is it possible to separate the phrases knowing that at the end of each sentence I added the word ""INTERRUPTw"?)
sorry i didn't reply earlier!
when you're using tokenize()
or treetag()
, you shouldn't have to mark sentences manually. you can use the POS tags indicating sentence ending punctuation for that (try kRp.POS.tags("it", tags="sentc")
or kRp.POS.tags("it", tags="sentc", list.tags=TRUE)
to get the tags you need for this). adding your own token for that will probably only invalidate all statistics for the text, because it is counted as a word belonging to the next sentence.
but this seems to be a different issue than the one this started off with. can we close this ticket?
Yes of course. Thanks
I have a problem moving from a tm object to a koRpus object. I have to normalize a corpus with tm tools, lemmatize the results with koRpus and return to tm to categorize the results. In order to do this I have to transform the tm object into a R dataframe, which I then transform into an excel file, then into a txt file, and finally into a koRpus object. This is the code:
Then I need to do it all backwards to get back to tm. This is the code:
Is there a more elegant solution to do this without leaving R? I’d really appreciate any help or feedback.