unDocUMeantIt / koRpus

An R Package for Text Analysis
GNU General Public License v3.0
45 stars 6 forks source link

Improperly created matrix due to bad strsplit() when processing certain texts with unusual UTF-8 characters #3

Closed mekkim closed 8 years ago

mekkim commented 8 years ago

R version 3.2.3 koRpus version 0.06-4 (latest dev branch)

When processing certain texts, koRpus::treetag returns

"Error: Invalid tag(s) found: ¶¥¥, , @card@, *ÿÀ3xÿÀ3z, Mac This is probably due to a missing tag in kRp.POS.tags() and needs to be fixed. It would be nice if you could forward the above error dump as a bug report to the package maintainer!"

The specific "invalid tags" vary based on the text itself. Prior to the error, it throws a warning:

Warning in matrix(unlist(strsplit(tagged.text, "\t")), : data length [6736] is not a sub-multiple or multiple of the number of rows [2246]

which seems to suggest that strsplit is not properly splitting the results from the external tree-tagger program, leading to the matrix call building the columns incorrectly (see details below and attached file for examples of wrong columns in returned matrix).

I first checked to ensure that the external tree-tagger program was returning a valid tagged output with these texts. It was fine.

Upon further testing, I was able to reproduce it with 3 different texts, all with similar strsplit-related warnings screwing up which columns are token, tag, and lemma. Unsurprisingly, that then screws up the tag reading since it's reading a value in a column that was populated incorrectly.

The statement is on line 425 of treetag.R in the koRpus source code: tagged.mtrx <- matrix(unlist(strsplit(tagged.text, "\t")), ncol=3, byrow=TRUE, dimnames=list(c(),c("token","tag","lemma")))

strsplit is almost certainly the culprit, probably mis-splitting certain special cases. The texts in question all have bug report data in them including special characters (UTF-8). I confirmed that the text is being passed to koRpus::treetag with UTF-8 encoding. I also experimented adding the useBytes=TRUE and/or perl=TRUE options to strsplit and recompiling. No change.

In searching for an alternative to strsplit, I came across package stringi, which appears to be specifically targetted at proper UTF-8 handling, possibly better (and almost certainly faster) than strsplit: Stringi homepage. In particular its stri_split functions look like they may be able to provide an inline replacement for the whole matrix line as they can return a matrix constructed by row in a single command. I haven't yet experimented with it, but I wonder if it may be a solution.

In summary, these warnings & resulting errors appear only on a very small portion of the bug reports that I'm processing (maybe 0.1% or so, of around 1M--the three provided are a tiny subset; there are actually thousands that won't process in my whole dataset), which leads me to suspect it's related to strsplit not properly handling a rare/unused UTF-8 character that only appears in a small portion of bug reports, possibly when the bug reports include hex-encoded binary dumps for troubleshooting (though not even in all such cases).

Attached you will find a file with the 3 texts that trigger the error/warning, organized as bug_id, full text content enclosed in ========= for endpoint clarity, followed by the warning + error messages. Each bug is separated by ---------------- for start/endpoint clarity. If you wish, I can also provide similar texts that process fine and don't result in the warning + error.

Please let me know if I can provide any other useful information for debugging.

Thanks once again for the excellent package! ^_^

bug descriptions triggering koRpus_treetag warning & error.txt

unDocUMeantIt commented 8 years ago

thanks for reporting this, especially for providing useful examples! i'll try to replicate the issue and see how it could be fixed.

unDocUMeantIt commented 8 years ago

i can replicate the issue on windows, but not on GNU/linux. i can't check on OS X at the moment, but i'd expect the result to be similar to GNU/linux. are you running your analyses on windows?

if this is a problem caused by the operating system not handling UTF-8 well, i'm not sure this could/should be fixed in the package. but i'll keep trying.

unDocUMeantIt commented 8 years ago

as i went deeper into this, i think i found the core of the problem. it's not strsplit(), but the return value of the shell() call. on windows, it returns malformed text like this:

[2246] "337A\tNP\t<unknown>"                                    
[2247] "¥¥"                                                 
[2248] "º¥¥"                                               
[2249] "*ÿÀ3xÿÀ3z\tNP\t<unknown>"

this is TreeTagger's output of the same passage when invoked on the command line (windows 7):

337A    NP      337A
┬Ñ┬Ñ⌂┬║┬Ñ┬Ñ⌂*   NP      ┬Ñ┬Ñ⌂┬║┬Ñ┬Ñ⌂*
À3x    NP      À3x
À3z    NP      À3z

the original text, however, looks like this:

337A  ¥¥º¥¥*ÿÀ3xÿÀ3z

you see that there's one token broken into three. on GNU/linux, this is what i get in the end:

337A  NP <unknown>
¥¥\177º¥¥\177*ÿÀ3xÿÀ3z  NP <unknown>

which to me looks pretty close to the original input -- "\177" refers to the ASCII code for the delete character, which actually seems to be in the report, i've run the text through "od -c". unfortunately, there is no way to fix this in koRpus. judging from the TreeTagger output above, which doesn't represent the input very well in the first place, this has to be dealt with somewhere else.

i'll close the issue. i recommend filing this to the developer of TreeTagger.

mekkim commented 8 years ago

Ah! Glad it wasn't strsplit(). That would have been a pain! Yes, I was running it on Windows, but didn't notice the discrepancy you found (great job!).

Thanks very much for taking the time to troubleshoot this further. I'll see if I can find a solution for the Windows shell problems and report the bug to TreeTagger.

Much appreciated!

mekkim commented 8 years ago

Upon further testing, I've figured out how to fix the bug in TreeTagger. In the PERL script tokenize.pl which appears in the cmd directory of the default TreeTagger install, on lines 94-95, I replaced the lines:

 # replace newlines and tab characters with blanks
  tr/\n\t/  /;

with the lines:

  # replace newlines and tab and delete characters with blanks
  tr/\n\t\177/   /;

That solved the issue!

mekkim commented 8 years ago

Better yet, to ensure that all other control characters aren't missed besides \177, I switched to this value, which also works (and probably catches other errors that we haven't yet seen):

 # replace newlines and tab and any other UTF-8 control characters characters with spaces
  s/[\n\t\p{XPosixCntrl}]/ /g;
mekkim commented 8 years ago

FYI, Dr. Helmut Schmid received my bug report and confirmed that he has added the fix to the TreeTagger code. Yay!

unDocUMeantIt commented 8 years ago

+1