unDocUMeantIt / koRpus

An R Package for Text Analysis
GNU General Public License v3.0
45 stars 6 forks source link

incomplete import of LCC corpus #20

Closed friederikebusse closed 4 years ago

friederikebusse commented 5 years ago

Hey,

I want to do a frequency analysis with my text, using an LCC corpus [http://wortschatz.uni-leipzig.de/en/download/]. Unfortunately koRpus only reads in some of the first hundred lines, the frequency analysis then fails (indices are mostly "NA"). I've tried different copora, the import is always incomplete, the number of lines that are read in varies with each corpus, I don't see systematics. When reading in the corpus in I get the following output.

LCC.data <- read.corp.LCC("deu_news_2015_1M.tar") output: Fetching needed files from LCC archive... done. Warning messages: 1: In readLines(LCC.file.con, n = n) : invalid input found on input connection 'C:\Users\rieke\AppData\Local\Temp\RtmpKwiXOi\koRpus.LCC3d9049ac4d87/deu_news_2015_1M/deu_news_2015_1M-words.txt' 2: In readLines(LCC.file.con, n = n) : incomplete final line found on 'C:\Users\rieke\AppData\Local\Temp\RtmpKwiXOi\koRpus.LCC3d9049ac4d87/deu_news_2015_1M/deu_news_2015_1M-words.txt' 3: In matrix(unlist(strsplit(rL.words, "\t")), ncol = 4, byrow = TRUE, : data length [77246] is not a sub-multiple or multiple of the number of rows [19312] 4: In read.corp.LCC("deu_news_2015_1M.tar") : This looks like a newer LCC archive with four columns in the *-words.txt file. The two word columns did not match, but we'll only use the first one! 5: In create.corp.freq.object(matrix.freq = table.words, num.running.words = num.running.words, : NAs introduced by coercion

From this corpus, R always reads in the first 19312 lines. The curpus actually has about 700 thousand lines. Any word after the 19312th line can't be found.

Example: query(LCC.data, "word", "Proton")

[1] num word lemma tag wclass lttr freq pct pmio log10 rank.avg rank.min
[13] rank.rel.avg rank.rel.min inDocs idf

<0 Zeilen> (oder row.names mit Länge 0) I can't imagine that the corpus is too big. Michalke used a corpus of 1mio sentences in his Manual as well... I am thankful for any hint! Friedi
unDocUMeantIt commented 5 years ago

hi,

i tried to reproduce the issue and downloaded a fresh copy of the deu_news_2015_1M.tar.gz archive. but on my machine it is imported as expected:

> LCC.de <- read.corp.LCC("deu_news_2015_1M.tar.gz")
Fetching needed files from LCC archive... done.
> dim(LCC.de@words)
[1] 780213     16

it takes about 16 seconds.

i have a hunch this is just another character encoding fuckup. i tried this on GNU/linux, you seem to be using windows. the offending "word" in line 19312 of the data frame is the character , which i assume is throwing off R on windows somehow.

i find that odd because read.corp.LCC() by default uses fileEncoding = "UTF-8".

what version of windows, R and koRpus are you using?

friederikebusse commented 5 years ago

Thanks for your quick response!

I am using windows 10, RStudio Version 1.1.463 and koRpus Version 0.11-5

unDocUMeantIt commented 5 years ago

RStudio Version 1.1.463

i'll asume that comes with a recent version of R.

i've seen issues before that were caused by something in the environment RStudio provides. so we should first rule that out. can you please start a plain R session without RStudio and try the same code again? do you run into the same encoding issue?

if not, we could blame RStudio ;) otherwise, we should first confirm that we've tracked down the core of the issue. from the error message it looks like readLines() fails to parse the file deu_news_2015_1M-words.txt which was extracted from the .tar.gz archive.

so what you could do is extract the file manually from the archive and try to replicate the failing readLines() call. i would assume that the encoding option is the one to experiment with to get the call to produce reliable results (please adjust paths accordingly):

untar(
  "deu_news_2015_1M.tar.gz",
  files="deu_news_2015_1M/deu_news_2015_1M-words.txt",
  exdir="<output directory>"
)

deu <- readLines(
  file.path(
    "<output directory>",
    "deu_news_2015_1M",
    "deu_news_2015_1M-words.txt"
  ),
  encoding="UTF-8"
)
friederikebusse commented 5 years ago

Thanks a lot for helping me out!

So I deleted the old versions of R and RStudio and installed the updated versions. First, I tried the same code again in plain R, this did not work, I ended up with the same problem.

I then used your code to extract the necessary files manually. That worked. The variable "deu" indeed has 1175642 lines (not the 780213 lines you got because I switched to a wikipedia corpus, sorry for the chaos)

the following line worked to read in the LCC data

LCC.data <- read.corp.LCC("deu_wikipedia_2016_1M", fileEncoding = "UTF-8", 
prefix = "deu_wikipedia_2016_1M-")

But the data is still only 45 lines long, stopping at some strange character sign.

i would assume that the encoding option is the one to experiment with to get the call to produce reliable results

...What did you mean by that?

concerning the readLines() call: the following worked

lines <- readLines(file('deu_wikipedia_2016_1M/deu_wikipedia_2016_1M-words.txt'))

but when looking at the words which are read in, any character which is not an ASCII sign, like ä/ö/ü ... is not read correctly

Do you have an idea what else to do?

Thanks again!!

unDocUMeantIt commented 5 years ago

First, I tried the same code again in plain R, this did not work, I ended up with the same problem.

ok, so we can rule that one out; this is indeed a character encoding issue.

But the data is still only 45 lines long, stopping at some strange character sign.

yeah, it stops at the first "broken" character it reads, because it doesn't know how to proceed. the files inside the LCC archive are clearly in UTF-8 format, meaning the task at hand is to get R to import it properly. we need to figure out how readLines() must be configured so the imported text is displayed correctly.

by encoding option i meant playing around with the encoding argument of the readLines() function. by default, koRpus sets it to UTF-8, which in your case obviously leads to incomplete results. so trying to change the encoding when you import the raw text with readLines() looks like a reasonable way to find a solution.

alternatively, could you try to nest your readLines() call in enc2utf8() (i.e., enc2utf8(readLines(...))) and check the umlauts again?

unDocUMeantIt commented 4 years ago

are you still working on this? if not, i'll close the issue.

friederikebusse commented 4 years ago

No I don’t, thanks for closing the issue!

Von: unDocUMeantIt notifications@github.com Gesendet: Sonntag, 10. Mai 2020 14:35 An: unDocUMeantIt/koRpus koRpus@noreply.github.com Cc: FriediBusse friederike.busse@uni-ulm.de; Author author@noreply.github.com Betreff: Re: [unDocUMeantIt/koRpus] incomplete import of LCC corpus (#20)

are you still working on this? if not, i'll close the issue.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/unDocUMeantIt/koRpus/issues/20#issuecomment-626321553 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ANEJU5VSYRIFC2XY4L4ELGLRQ2NQDANCNFSM4IU2NJMQ . https://github.com/notifications/beacon/ANEJU5VOKD6MS5XQOGSIV5TRQ2NQDA5CNFSM4IU2NJM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEVKOREI.gif