unDocUMeantIt / koRpus

An R Package for Text Analysis
GNU General Public License v3.0
45 stars 6 forks source link

treetegger working with a dataset in R #32

Closed olga-chemod closed 3 years ago

olga-chemod commented 3 years ago

Hello! I have a dataset which contains tweets in Russian and id of the tweets. I want to lemmatize the text of the tweets and get a dataset with id of tweets and lemmas of these tweets. The problem is that if I understand it correctly, treetagger works only with files but not R datasets. So, I downloaded my dataset as a txt file with only text of tweets but not their IDs because if I download the dataset with both text and ids, nothing works. So, with the code below I got this result (in the picture). With this result, I cannot identify the id of a tweet. What can I do?

perform POS tagging

set.kRp.env(TT.cmd="C://TreeTagger/bin/tag-russian.bat", lang="ru", encoding = "UTF-8") postagged <- treetag("C:/Users/Ольга/Documents/new16_20.txt", treetagger="manual", lang="ru", TT.options=list( path=file.path("C://TreeTagger"), preset="ru")) data = postagged@tokens

w
unDocUMeantIt commented 3 years ago

if i understand you correctly, you would like doc_id to show you the ID of a tweet, but the document new16_20.txt includes all tweets in one file, right?

indeed, treetag() currently does not support data frames directly. the focus of koRpus is on single texts only. check out the extension package tm.plugin.koRpus which allows to import multiple texts as one corpus. you'd then either have to split up your tweets into one tweet per file and name them like the ID you want to use as doc_id. or make sure your data frame is TIF compliant (i.e., two columns named doc_id and text), then you can use tm.plugin.koRpus::readCorpus() to import all tweets with one single call (that function does support text in data frame format).

i'd also recommend using the presets if possible:

set.kRp.env(
    TT.cmd="manual",
    TT.options=list(
        path="C://TreeTagger",
        preset="ru"
    ),
    lang="ru"
)

encoding is ignored by set.kRp.env().

you should also use the getter methods in scripts, e.g., taggedText(postagged) instead of postagged@tokens, so your code does not depend on the inner structure of object (it has changed from time to time).

olga-chemod commented 3 years ago

Thank you a lot, it worked. I also wonder if it is possible to manually assign class and fix lemmas of those words which have the class 'unknown'. For example, "коронавирус" (coronavirus) has the wrong lemma.

unDocUMeantIt commented 3 years ago

if by class you mean the column wclass, there is no special method to change these values directly. that is because wclass is defined in the language package in conjunction with the POS tags. however, the method koRpus::correct.tag() can be used to fix both, wrong POS tags and lemmas, and thereby also alter wclass indirectly.

if your issue is resolved, we should close this ticket.

olga-chemod commented 3 years ago

Thank you. Also, when I perform readCorpus I get 'Error in 1:max.value : result would be too long a vector'. What is the maximum value of a vector? The dataset that I use contains about 170000 observations.

unDocUMeantIt commented 3 years ago

hm, hard to tell from here where the problem occurs, but this error is from base R and not specific to koRpus. a matrix is also represented as a vector, so with large tables you might hit R's limits.

each of your conversations (tweets, i assume) is being tokenized first, so you will end up with a data frame that has 170k * mean(tokens per tweet) rows, could be something around 8 or 9 million rows. multiply that by the number of columns.

you could split your input data into chunks of various sizes and test how much readCorpus() can take. if you found the limit, i'd be interested in some statistics.

unDocUMeantIt commented 3 years ago

do you consider this issue closed?