Closed mekkim closed 8 years ago
thanks for reporting this, especially for providing useful examples! i'll try to replicate the issue and see how it could be fixed.
i can replicate the issue on windows, but not on GNU/linux. i can't check on OS X at the moment, but i'd expect the result to be similar to GNU/linux. are you running your analyses on windows?
if this is a problem caused by the operating system not handling UTF-8 well, i'm not sure this could/should be fixed in the package. but i'll keep trying.
as i went deeper into this, i think i found the core of the problem. it's not strsplit(), but the return value of the shell() call. on windows, it returns malformed text like this:
[2246] "337A\tNP\t<unknown>"
[2247] "¥¥"
[2248] "º¥¥"
[2249] "*ÿÀ3xÿÀ3z\tNP\t<unknown>"
this is TreeTagger's output of the same passage when invoked on the command line (windows 7):
337A NP 337A
┬Ñ┬Ñ⌂┬║┬Ñ┬Ñ⌂* NP ┬Ñ┬Ñ⌂┬║┬Ñ┬Ñ⌂*
À3x NP À3x
À3z NP À3z
the original text, however, looks like this:
337A ¥¥º¥¥*ÿÀ3xÿÀ3z
you see that there's one token broken into three. on GNU/linux, this is what i get in the end:
337A NP <unknown>
¥¥\177º¥¥\177*ÿÀ3xÿÀ3z NP <unknown>
which to me looks pretty close to the original input -- "\177" refers to the ASCII code for the delete character, which actually seems to be in the report, i've run the text through "od -c". unfortunately, there is no way to fix this in koRpus. judging from the TreeTagger output above, which doesn't represent the input very well in the first place, this has to be dealt with somewhere else.
i'll close the issue. i recommend filing this to the developer of TreeTagger.
Ah! Glad it wasn't strsplit()
. That would have been a pain! Yes, I was running it on Windows, but didn't notice the discrepancy you found (great job!).
Thanks very much for taking the time to troubleshoot this further. I'll see if I can find a solution for the Windows shell problems and report the bug to TreeTagger.
Much appreciated!
Upon further testing, I've figured out how to fix the bug in TreeTagger
. In the PERL
script tokenize.pl
which appears in the cmd
directory of the default TreeTagger
install, on lines 94-95, I replaced the lines:
# replace newlines and tab characters with blanks
tr/\n\t/ /;
with the lines:
# replace newlines and tab and delete characters with blanks
tr/\n\t\177/ /;
That solved the issue!
Better yet, to ensure that all other control characters aren't missed besides \177, I switched to this value, which also works (and probably catches other errors that we haven't yet seen):
# replace newlines and tab and any other UTF-8 control characters characters with spaces
s/[\n\t\p{XPosixCntrl}]/ /g;
FYI, Dr. Helmut Schmid received my bug report and confirmed that he has added the fix to the TreeTagger code. Yay!
+1
R version 3.2.3 koRpus version 0.06-4 (latest dev branch)
When processing certain texts,
koRpus::treetag
returnsThe specific "invalid tags" vary based on the text itself. Prior to the error, it throws a warning:
which seems to suggest that
strsplit
is not properly splitting the results from the externaltree-tagger
program, leading to thematrix
call building the columns incorrectly (see details below and attached file for examples of wrong columns in returned matrix).I first checked to ensure that the external
tree-tagger
program was returning a valid tagged output with these texts. It was fine.Upon further testing, I was able to reproduce it with 3 different texts, all with similar
strsplit
-related warnings screwing up which columns aretoken
,tag
, andlemma
. Unsurprisingly, that then screws up the tag reading since it's reading a value in a column that was populated incorrectly.The statement is on
line 425
oftreetag.R
in thekoRpus
source code:tagged.mtrx <- matrix(unlist(strsplit(tagged.text, "\t")), ncol=3, byrow=TRUE, dimnames=list(c(),c("token","tag","lemma")))
strsplit
is almost certainly the culprit, probably mis-splitting certain special cases. The texts in question all have bug report data in them including special characters (UTF-8). I confirmed that the text is being passed tokoRpus::treetag
withUTF-8
encoding. I also experimented adding theuseBytes=TRUE
and/orperl=TRUE
options to strsplit and recompiling. No change.In searching for an alternative to
strsplit
, I came across packagestringi
, which appears to be specifically targetted at properUTF-8
handling, possibly better (and almost certainly faster) thanstrsplit
: Stringi homepage. In particular its stri_split functions look like they may be able to provide an inline replacement for the wholematrix
line as they can return a matrix constructed by row in a single command. I haven't yet experimented with it, but I wonder if it may be a solution.In summary, these warnings & resulting errors appear only on a very small portion of the bug reports that I'm processing (maybe 0.1% or so, of around 1M--the three provided are a tiny subset; there are actually thousands that won't process in my whole dataset), which leads me to suspect it's related to
strsplit
not properly handling a rare/unusedUTF-8
character that only appears in a small portion of bug reports, possibly when the bug reports include hex-encoded binary dumps for troubleshooting (though not even in all such cases).Attached you will find a file with the 3 texts that trigger the error/warning, organized as bug_id, full text content enclosed in ========= for endpoint clarity, followed by the warning + error messages. Each bug is separated by ---------------- for start/endpoint clarity. If you wish, I can also provide similar texts that process fine and don't result in the warning + error.
Please let me know if I can provide any other useful information for debugging.
Thanks once again for the excellent package! ^_^
bug descriptions triggering koRpus_treetag warning & error.txt