Closed MarcinKosinski closed 7 years ago
the path
to TreeTagger is wrong. try an absolute path beginning with the drive letter.
Hello, thanks for the fast reply. I have the TreeTagger
both in the repository in the path I am currently working and in the C:/
directory
> list.files('C:/TreeTagger')
[1] "bin" "cmd" "INSTALL.txt" "INSTALL.txt~" "lib" "README.txt"
> list.files('/TreeTagger')
character(0)
> list.files('TreeTagger')
[1] "bin" "cmd" "INSTALL.txt" "INSTALL.txt~" "lib" "README.txt"
For the absolute path the results is the same
> tagged.results <- treetag(c("run", "ran", "running"), treetagger="manual", format="obj",
+ TT.tknz=FALSE , lang="en",
+ debug = TRUE,
+ TT.options=list(path="C:/TreeTagger", preset="en"))
split=[[:space:]]
ign.comp=-
heuristics=abbr
heur.fix=c("’", "'"), c("’", "'")
sentc.end=., !, ?, ;, :
detect=FALSE, FALSE
clean.raw=
perl=FALSE
stopwords=
stemmer=
Assuming 'UTF-8' as encoding for the input file. If the results turn out to be erroneous, check the file for invalid characters, e.g. em.dashes or fancy quotes, and/or consider setting 'encoding' manually.
TT.tokenizer: koRpus::tokenize()
tempfile: C:\Users\Marcin\AppData\Local\Temp\Rtmp2PQ5Ts\tokenizef9422ad2921.txt
file: C:\Users\Marcin\AppData\Local\Temp\Rtmp2PQ5Ts\tempTextFromObjectf947302209a.txt
TT.lookup.command:
TT.pre.tagger:
TT.tagger: C:/TreeTagger/bin/tree-tagger.exe
TT.opts: -token -lemma -sgml -pt-with-lemma -quiet
TT.params: C:/TreeTagger/lib/english-utf8.par
TT.filter.command: | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;'
sys.tt.call: type C:\Users\Marcin\AppData\Local\Temp\Rtmp2PQ5Ts\tokenizef9422ad2921.txt | C:/TreeTagger/bin/tree-tagger.exe C:/TreeTagger/lib/english-utf8.par -token -lemma -sgml -pt-with-lemma -quiet | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;'
Error in matrix(unlist(strsplit(tagged.text, "\t")), ncol = 3, byrow = TRUE, :
'data' must be of a vector type, was 'NULL'
In addition: Warning message:
running command 'C:\Windows\system32\cmd.exe /c type C:\Users\Marcin\AppData\Local\Temp\Rtmp2PQ5Ts\tokenizef9422ad2921.txt | C:\TreeTagger\bin\tree-tagger.exe C:\TreeTagger\lib\english-utf8.par -token -lemma -sgml -pt-with-lemma -quiet | perl -pe 's\\tV[BDHV]\\tVB\;s\IN\\that\\tIN\;'' had status 9
I have installed PERL and downloaded the english-utf8.par
file (that is included in the lib/
directory.
i see (and i was wondering why treetag()
didn't complain about missing files, but of course it won't if they're not missing...).
what's inside the tagged.results
object? if all goes well, it's a matrix with three columns.
can you open a command line and execute the full line after sys.tt.call:
, beginning with type
? what does it return? the was "NULL"
error usually occurs if TreeTagger doesn't return what koRpus is expecting, which is a character vector with tab separation (should look like three columns in the terminal).
[the command does work on my linux machine. but apart from your actual issue, it seems TT.tknz=FALSE
seems to cut off the last character of the input vector -- i need to investigate this.]
As for standard R
execution that finishes with error the final object is not assigned. I get the message that the Error: object 'tagged.results' not found
.
I can not run anything that is after
sys.tt.call:
, beginning withtype
as this requires some temporary files (that I do not longer have) and that are probably made out of the source vector c("run", "ran", "running")
But it looks like the regular TreeTagger (not invoked from R) works properly (even though I didn't specify the final file to be lemmatized)
debug=TRUE
should actually keep the temp files as long as the R session is running. did you close your session in the meantime?
I didn't. Maybe it does not keep them when the error appears? I am lemmatizing from the command line anyway :)
no, tempfiles should be kept, at least i'm sure they were in the past, because that's the method that we've been debugging these issues for a long time.
which brings me to the hypothesis that maybe generating the tempfile doesn't work for you in the first place? if the file can't be written, for whatever reason, then no tagging could be done.
have you successfully used koRpus earlier? just to see if this is something that way introduced with the last release.
have you checked that perl is in your path on the command line? even if TreeTagger works, the following perl filter might break the full call. this should also cause an error if you try to use TreeTagger's tokenizer or the batch scripts that TreeTagger is usually run with.
@unDocUMeantIt it was the issue of not being able to create a temporary file
below is the example of another character string for which the treetag
works
> library(koRpus)
> tagged.results <- treetag(file = c('TreeTagger/texts/T1_to_be_lemmatized.txt'),
+ treetagger="manual", format="obj",
+ TT.tknz=FALSE , lang="en",
+ debug = TRUE,
+ TT.options=list(path="TreeTagger", preset="en"))
split=[[:space:]]
ign.comp=-
heuristics=abbr
heur.fix=c("’", "'"), c("’", "'")
sentc.end=., !, ?, ;, :
detect=FALSE, FALSE
clean.raw=
perl=FALSE
stopwords=
stemmer=
Assuming 'UTF-8' as encoding for the input file. If the results turn out to be erroneous, check the file for invalid characters, e.g. em.dashes or fancy quotes, and/or consider setting 'encoding' manually.
TT.tokenizer: koRpus::tokenize()
tempfile: C:\Users\Marcin\AppData\Local\Temp\RtmpIxHmph\tokenize20681e628eb.txt
file: C:\Users\Marcin\AppData\Local\Temp\RtmpIxHmph\tempTextFromObject20682a867b18.txt
TT.lookup.command:
TT.pre.tagger:
TT.tagger: TreeTagger/bin/tree-tagger.exe
TT.opts: -token -lemma -sgml -pt-with-lemma -quiet
TT.params: TreeTagger/lib/english-utf8.par
TT.filter.command: | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;'
sys.tt.call: type C:\Users\Marcin\AppData\Local\Temp\RtmpIxHmph\tokenize20681e628eb.txt | TreeTagger/bin/tree-tagger.exe TreeTagger/lib/english-utf8.par -token -lemma -sgml -pt-with-lemma -quiet | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;'
> tagged.results
token tag lemma
[1,] "TreeTagger" "NN" "<unknown>"
[2,] "/" "SYM" "/"
[3,] "texts" "NNS" "text"
[4,] "/" "SYM" "/"
[5,] "T1" "NP" "<unknown>"
[6,] "_" "SYM" "_"
[7,] "to" "TO" "to"
[8,] "_" "SYM" "_"
[9,] "be" "VB" "be"
[10,] "_" "SYM" "_"
[11,] "lemmatized" "JJ" "<unknown>"
[12,] "." "SENT" "."
[13,] "txt" "NN" "<unknown>"
Maybe one should add an info if the temp
file couldn't be created?
The PERL adds itself to the PATH during the installation.
I did succeed with the tokniezer()
R function previously.
Thanks for answering and for your previous time!
now, that's odd -- looks like the tempfile is not created only when you use the type="obj"
option, because there is successful tempfile creation in your second example. i'll leave this open until i have a clue what's (not) happening there.
i've looked at the treetag()
code but so far have no clue what could cause this. it doesn't seem to happen on GNU/linux, but that doesn't explain it. it is unliekly that tempfiles are missing, because treetag()
checks for their existance.
i've installed koRpus
in a windows 10 VM and can replicate the problem.
it seems to be caused by inconsistencies between file.path()
and shell()
, something which used to work for years but now appears to be broken. try
shell(paste("dir", file.path("C:","Users"))
versus the explicit
shell(paste("dir", file.path("C:","Users", fsep="\\"))
i hope i found a workaround by forcibly replacing all instances of /
by \
in paths for windows users. could you try the develop version and tell me if this fixes the issue for you? here's how can install it directly from github:
library("devtools")
install_github("unDocUMeantIt/koRpus", ref="develop")
Hi @unDocUMeantIt, thanks for the great effort.
Shouldn't the file.path
be enough of being system agnostic? And the fsep="\\"
looks like system specification : P
the problem here is that we're using shell()
to run commands in windows' cmd.exe
which expects \
as the file separator, but R uses /
even on windows. as i said, this hasn't been a problem before, at least not one that has ever been reported, and strangely enough, it still doesn't occur in your second example! i have no idea why windows is suddenly so picky.
of course you're right, fsep="\\"
makes file.sep()
create paths that are usual for windows only, but that's the point here -- on my windows 10 machine, running dir
in shell()
on just some file.path()
does not work, i must force it to separate paths with \\
.
this is pretty ugly, but i must say it's just one more ugly thing in a vast pool of ugliness that has accumulated around windows over the years. i can only guess this is because back in the day it was literally designed to make users use microsoft products exclusively, and now when we try to do things cross-platform, we all suffer from all the "we don't care about other standards"-stuff that was common in windows for many years. i don't know. it could also be a regression bug in R, or something that should have been a bug in the past an now actually is.
for me, the most important question remains: is the problem gone now?
I thinki that the save paths for windows are /
.
I've just installed development version of koRpus on my second machine (windows 10) and I have having such a problem
split=[[:space:]]
ign.comp=-
heuristics=abbr
heur.fix=c("’", "'"), c("’", "'")
sentc.end=., !, ?, ;, :
detect=FALSE, FALSE
clean.raw=
perl=FALSE
stopwords=
stemmer=
TT.tokenizer: koRpus::tokenize()
tempfile: C:\Users\kosteck\AppData\Local\Temp\RtmpaUG7yI\tokenize218c51453171.txt
file: C:\Users\kosteck\AppData\Local\Temp\RtmpaUG7yI\tempTextFromObject218c5d48213d.txt
TT.lookup.command:
TT.pre.tagger:
TT.tagger: C:/TreeTagger/bin/tree-tagger.exe
TT.opts: -token -lemma -sgml -pt-with-lemma -quiet
TT.params: C:/TreeTagger/lib/english-utf8.par
TT.filter.command: | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;'
sys.tt.call: type C:\Users\kosteck\AppData\Local\Temp\RtmpaUG7yI\tokenize218c51453171.txt | C:\TreeTagger\bin\tree-tagger.exe C:\TreeTagger\lib\english-utf8.par -token -lemma -sgml -pt-with-lemma -quiet | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;'
Error: Awww, this should not happen: TreeTagger didn't return any useful data.
This can happen if the local TreeTagger setup is incomplete or different from what presets expected.
You should re-run your command with the option 'debug=TRUE'. That will print all relevant configuration.
Look for a line starting with 'sys.tt.call:' and try to execute the full command following it in a
command line terminal. Do not close this R session in the meantime, as 'debug=TRUE' will keep temporary
files that might be needed.
If running the command after 'sys.tt.call:' does fail, you'll need to fix the TreeTagger setup.
If it does *not* fail but produce a table with proper results, please contact the author!
In addition: Warning message:
running command 'C:\WINDOWS\system32\cmd.exe /c type C:\Users\kosteck\AppData\Local\Temp\RtmpaUG7yI\tokenize218c51453171.txt | C:\TreeTagger\bin\tree-tagger.exe C:\TreeTagger\lib\english-utf8.par -token -lemma -sgml -pt-with-lemma -quiet | perl -pe 's\\tV[BDHV]\\tVB\;s\IN\\that\\tIN\;'' had status 255
Assuming 'UTF-8' as encoding for the input file. If the results turn out to be erroneous, check the file for invalid characters, e.g. em.dashes or fancy quotes, and/or consider setting 'encoding' manually.
I can not run the command that is after sys.tt.cal
as the \
mark is not recognized by the system
$ type C:\Users\kosteck\AppData\Local\Temp\RtmpaUG7yI\tokenize218c51453171.txt | C:\TreeTagger\bin\tree-tagger.exe C:\TreeTagger\lib\english-utf8.par -token -lemma -sgml -pt-with-lemma -quiet | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;'
bash: type: C:UserskosteckAppDataLocalTempRtmpaUG7yItokenize218c51453171.txt: not found
bash: C:TreeTaggerbintree-tagger.exe: command not found
but when I change the \
sign into the /
then the treeTagger works perfectly
$ TreeTagger/bin/tree-tagger.exe TreeTagger/lib/english-utf8.par C:/Users/kosteck/AppData/Local/Temp/RtmpaUG7yI/tokenize218c51453171.txt -token -lemma -sgml -pt-with-lemma -quiet | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;' | head
good JJ good
marketing NN marketing
music NN music
industry NN industry
entertainment NN entertainment
finding VBG find
new JJ new
ways NNS way
connect VBP connect
new JJ new
The session info is below
> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250 LC_MONETARY=Polish_Poland.1250
[4] LC_NUMERIC=C LC_TIME=Polish_Poland.1250
attached base packages:
[1] grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] koRpus_0.10-3 devtools_1.12.0 NbClust_3.0 cluster_2.0.6 factoextra_1.0.4
[6] foreach_1.4.3 openxlsx_4.0.17 networkD3_0.4 VennDiagram_1.6.17 futile.logger_1.4.3
[11] Boruta_5.2.0 ranger_0.7.0 scales_0.4.1 ggmosaic_0.1.2 productplots_0.1.1
[16] corrplot_0.77 stringr_1.2.0 magrittr_1.5 dplyr_0.5.0 purrr_0.2.2
[21] readr_1.1.0 tidyr_0.6.1 tibble_1.3.0 tidyverse_1.1.1 readxl_0.1.1
[26] haven_1.0.0 plyr_1.8.4 tables_0.8 Hmisc_4.0-2 ggplot2_2.2.1
[31] Formula_1.2-1 survival_2.40-1 lattice_0.20-34 servr_0.5 LDAvis_0.3.2
[36] lda_1.4.2 pbapply_1.3-2 data.table_1.10.4 tm_0.7-1 NLP_0.1-10
[41] stringi_1.1.5
loaded via a namespace (and not attached):
[1] nlme_3.1-131 lubridate_1.6.0 RColorBrewer_1.1-2 httr_1.2.1 tools_3.3.3
[6] backports_1.0.5 R6_2.2.0 rpart_4.1-10 DBI_0.6-1 lazyeval_0.2.0
[11] colorspace_1.3-2 nnet_7.3-12 withr_1.0.2 gridExtra_2.2.1 mnormt_1.5-5
[16] curl_2.4 git2r_0.18.0 rvest_0.3.2 htmlTable_1.9 xml2_1.1.1
[21] plotly_4.5.6 slam_0.1-40 checkmate_1.8.2 psych_1.7.3.21 digest_0.6.12
[26] foreign_0.8-67 base64enc_0.1-3 htmltools_0.3.5 htmlwidgets_0.8 jsonlite_1.4
[31] acepack_1.4.1 Matrix_1.2-8 Rcpp_0.12.10 munsell_0.4.3 RJSONIO_1.3-0
[36] parallel_3.3.3 ggrepel_0.6.5 forcats_0.2.0 splines_3.3.3 hms_0.3
[41] knitr_1.15.1 igraph_1.0.1 reshape2_1.4.2 codetools_0.2-15 futile.options_1.0.0
[46] latticeExtra_0.6-28 lambda.r_1.1.9 modelr_0.1.0 httpuv_1.3.3 gtable_0.2.0
[51] assertthat_0.2.0 broom_0.4.2 viridisLite_0.2.0 iterators_1.0.8 memoise_1.0.0
thanks for testing!
why is R running in bash
when this is windows? koRpus
checks the operating system and changes its behaviour accordingly. if the OS is windows, it assumes shell commands will be executed in cmd.exe
.
Where do you see R is running in the bash
? I run R from the RStudio to provide the error message. Then I run the bash commands to make treeTagger lemmatize my text : P
i can tell from these error messages:
bash: type: C:UserskosteckAppDataLocalTempRtmpaUG7yItokenize218c51453171.txt: not found
bash: C:TreeTaggerbintree-tagger.exe: command not found
are you running R on a linux machine, but the RStudio frontend on windows? this looks really strange.
just to be clear, the command behind sys.tt.call
is not meant to run in bash
, but in cmd.exe
. the call for the one cannot work with the other and vice versa.
The fact that I have copied 2 responses from bash
doesn't mean I run R from bash
. bash
commands are the result of not being able to work from RStudio with treeTagger.
As I wrote: I run R from RStudio. From the session info you can find out that I am using windows (and I have windows issues for the paths). Any person can use git bash
(https://git-for-windows.github.io/) to run bash
commands on windows from Windows 7
.
Your package is really great. You've done good job. I have 2 machines with windows (on windows 7 I have no issues, on windows 10 I am pasting you mine errors). I will use koRpus on the windows 7 and in the future only from Ubuntu : )
Hi. I have almost exactly the same error message while executing:
words.cc <- treetag(file = c("cat", "cats", "catch", "catty"), treetagger="manual", format="obj",
TT.tknz=FALSE, lang="en", debug = T,
TT.options=list(path="c:/TreeTagger", preset="en"))
The error is:
split=[[:space:]]
ign.comp=-
heuristics=abbr
heur.fix=c("’", "'"), c("’", "'")
sentc.end=., !, ?, ;, :
detect=FALSE, FALSE
clean.raw=
perl=FALSE
stopwords=
stemmer=
Assuming 'UTF-8' as encoding for the input file. If the results turn out to be erroneous, check the file for invalid characters, e.g. em.dashes or fancy quotes, and/or consider setting 'encoding' manually.
TT.tokenizer: koRpus::tokenize()
tempfile: C:\Users\cp\AppData\Local\Temp\RtmpsDmpNk\tokenize17001238153f.txt
file: C:\Users\cp\AppData\Local\Temp\RtmpsDmpNk\tempTextFromObject170029114bdb.txt
TT.lookup.command:
TT.pre.tagger:
TT.tagger: c:/TreeTagger/bin/tree-tagger.exe
TT.opts: -token -lemma -sgml -pt-with-lemma -quiet
TT.params: c:/TreeTagger/lib/english-utf8.par
TT.filter.command: | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;'
sys.tt.call: type C:\Users\cp\AppData\Local\Temp\RtmpsDmpNk\tokenize17001238153f.txt | c:/TreeTagger/bin/tree-tagger.exe c:/TreeTagger/lib/english-utf8.par -token -lemma -sgml -pt-with-lemma -quiet | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;'
Error in matrix(unlist(strsplit(tagged.text, "\t")), ncol = 3, byrow = TRUE, :
'data' must be of a vector type, was 'NULL'
In addition: Warning message:
running command 'C:\Windows\system32\cmd.exe /c type C:\Users\cp\AppData\Local\Temp\RtmpsDmpNk\tokenize17001238153f.txt | c:\TreeTagger\bin\tree-tagger.exe c:\TreeTagger\lib\english-utf8.par -token -lemma -sgml -pt-with-lemma -quiet | perl -pe 's\\tV[BDHV]\\tVB\;s\IN\\that\\tIN\;'' had status 255
So as I can see with an raw eye this differs only in the same end, namely error status is 255. not 9. More, I can open the temp files where there're stored these words from input vector. Not far from here I was successfully using your great library with the previous version (don't know which one but in January) and this issue came with updating or R updating or with some other environmental changes. For example Java update or, as someone suggested to me, rJava installation (?).
I'm using Win-7, 64-bit, RStudio, R 3.3.3. I'd like to make this run. Maybe I'll try to downgrade something. It'd be useful to know what to downgrade, because I may cause total mess in my environment. And I may have to install it on other machines also.
After some test: Unlike the above case, I can't use such parameter: file = c('TreeTagger/texts/T1_to_be_lemmatized.txt')
in the function because it returns the file with those words in the path string, not its content. Using pure file = "c://File.txt"
does the same. And the error message of executing this string is the same.
koRpus
doesn't use java, this can't be related.
please make sure you're not using format="obj"
if file
is not the text you would like to analyze, but the path to a file with that text (then the defaullt must be used, see ?treetag
).
can you try and install the older version 0.06-5?
library("devtools")
install_github("unDocUMeantIt/koRpus", ref="0.06-5")
does that still work? if not, can you try the development branch:
install_github("unDocUMeantIt/koRpus", ref="develop")
does that change something for you?
Yes. It worked in the version 0.06-5 as before (this was my previous working version as I've seen in reps)
I've also installed just earlier 0.10-1 - the error was short:
Error: Specified file cannot be found:
c:/TreeTagger/cmd/utf8-tokenize.pl
like the script been forced to use my Windows locale, not currently set [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Thanks for quick response. Didn't realize the downgrading will be (temporarily I hope) the happy exit.
thanks for confirming 0.06-5 is still working on your machine. i still need to understand the root of this problem.
since that older version still calls tokenize.pl
in its presets, this could point to a too old TreeTagger installation, as recent windows versions don't ship that file any longer and replaced it with utf8-tokenize.perl
. the wrong file name in 0.10-1 which gave you the error was actually a bug that supposedly was fixed with 0.10-2. but please don't touch your TreeTagger installation just yet, until we really found the root of this!
instead, could you please do the following to help me track down the issue:
treetag()
call in 0.06-5 with debug=TRUE
and post the output heretreetag()
call and post the output heredevelop
version as explained above, run the treetag()
call again and post the output hereplease always add info whether the call succeeded or failed. that way i can compare the generated calls to the windows command shell to hopefully get a feeling why it breaks at some point. you can safely downgrade to 0.06-5 afterwards to keep working. i just need some info off a system where the issue is present.
could also please have a look at your C:/TreeTagger/cmd/
directory and see if you can find a file called utf8-tokenize.perl
?
thank you!
Ok. I've found some free time at last. So, while running v. 06-5"
x <- treetag(c("hat", "hot", "hit", "had", "hid"), treetagger="manual", format="obj", debug = T,
+ TT.tknz=FALSE, lang="en", TT.options=list(path="c:/TreeTagger", preset="en"))
output:
TT.tokenizer: koRpus::tokenize()
tempfile: C:\Users\cp\AppData\Local\Temp\Rtmpa81MXG\tokenize133ca792b9c.txt
file: C:\Users\cp\AppData\Local\Temp\Rtmpa81MXG\tempTextFromObject133c47a450c7.txt
TT.lookup.command:
TT.tagger: c:/TreeTagger/bin/tree-tagger.exe
TT.opts: -token -lemma -sgml -pt-with-lemma
TT.params: c:/TreeTagger/lib/english.par
TT.filter.command:
sys.tt.call: type C:\Users\cp\AppData\Local\Temp\Rtmpa81MXG\tokenize133ca792b9c.txt | c:/TreeTagger/bin/tree-tagger.exe c:/TreeTagger/lib/english.par -token -lemma -sgml -pt-with-lemma
Status is OK, but the x
is not an object but simple char array chr [1:5, 1:3]
, whether without debug it always returns object of class krp.tagged
, but you know it. There's some suspicious (for me) behavior of the warning message which always occur, despite the encoding is set or not.
Assuming '' as encoding for the input file. If the results turn out to be erroneous, check the file for invalid characters, e.g. em.dashes or fancy quotes, and/or consider setting 'encoding' manually.
because I couldn't suppress this message the way I do it for other warning messages (options(warn=-1)
)
While installing version 10-2 from that localization I've encountered an error:
Warning in install.packages :
package ‘http://cran.us.r-project.org/src/contrib/koRpus_0.10-2.tar.gz’ is not available (for R version 3.3.3)
This was by calling install.packages
because install_version
threw an error
trying URL 'http://cran.us.r-project.org/src/contrib/koRpus_0.10-2.zip'
Error in download.file(url, destfile, method, mode = "wb", ...) :
cannot open URL 'http://cran.us.r-project.org/src/contrib/koRpus_0.10-2.zip'
In addition: Warning message:
In download.file(url, destfile, method, mode = "wb", ...) :
cannot open URL 'http://cran.us.r-project.org/src/contrib/koRpus_0.10-2.zip': HTTP status was '404 Not Found'
Warning in download.packages(pkgs, destdir = tmpd, available = available, :
download of package ‘koRpus’ failed
That's because the extension returned was zip
and on website it's tar.gz
. I don't know how to change it. Is can be installed using Update package in RStudio -> the path is 'https://cran.rstudio.com/bin/windows/contrib/3.3/koRpus_0.10-2.zip'. The output is erroneous as I've posted earlier (still the same message).
Also using this command: install_github("unDocUMeantIt/koRpus") # stable release
. It worked the same way as above (the same error).
The develop version installed by install_github("unDocUMeantIt/koRpus", ref="develop")
is erroneous as well - the message was posted earlier by other user.
> x <- treetag(y, treetagger="manual", format="obj", debug = T,
+ TT.tknz=FALSE, lang="en", TT.options=list(path="c:/TreeTagger", preset="en"))
split=[[:space:]]
ign.comp=-
heuristics=abbr
heur.fix=c("’", "'"), c("’", "'")
sentc.end=., !, ?, ;, :
detect=FALSE, FALSE
clean.raw=
perl=FALSE
stopwords=
stemmer=
TT.tokenizer: koRpus::tokenize()
tempfile: C:\Users\cp\AppData\Local\Temp\Rtmp6rGSsW\tokenize14182f9b40be.txt
file: C:\Users\cp\AppData\Local\Temp\Rtmp6rGSsW\tempTextFromObject1418498667a9.txt
TT.lookup.command:
TT.pre.tagger:
TT.tagger: c:/TreeTagger/bin/tree-tagger.exe
TT.opts: -token -lemma -sgml -pt-with-lemma -quiet
TT.params: c:/TreeTagger/lib/english-utf8.par
TT.filter.command: | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;'
sys.tt.call: type C:\Users\cp\AppData\Local\Temp\Rtmp6rGSsW\tokenize14182f9b40be.txt | c:\TreeTagger\bin\tree-tagger.exe c:\TreeTagger\lib\english-utf8.par -token -lemma -sgml -pt-with-lemma -quiet | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;'
Error: Awww, this should not happen: TreeTagger didn't return any useful data.
This can happen if the local TreeTagger setup is incomplete or different from what presets expected.
You should re-run your command with the option 'debug=TRUE'. That will print all relevant configuration.
Look for a line starting with 'sys.tt.call:' and try to execute the full command following it in a
command line terminal. Do not close this R session in the meantime, as 'debug=TRUE' will keep temporary
files that might be needed.
If running the command after 'sys.tt.call:' does fail, you'll need to fix the TreeTagger setup.
If it does *not* fail but produce a table with proper results, please contact the author!
In addition: Warning message:
running command 'C:\Windows\system32\cmd.exe /c type C:\Users\cp\AppData\Local\Temp\Rtmp6rGSsW\tokenize14182f9b40be.txt | c:\TreeTagger\bin\tree-tagger.exe c:\TreeTagger\lib\english-utf8.par -token -lemma -sgml -pt-with-lemma -quiet | perl -pe 's\\tV[BDHV]\\tVB\;s\IN\\that\\tIN\;'' had status 255
I wonder why using install_version("koRpus", version = "0.06-5", repos = "http://cran.us.r-project.org")
works and the same with version 10-2
doesn't. The URL of older version is http://cran.us.r-project.org/src/contrib/Archive/koRpus/koRpus_0.06-5.tar.gz
so it has proper extension on website. But this is not an issue here probably.
Files in C://TreeTagger/cmd
c:\TreeTagger\cmd\utf8-tokenize.perl
c:\TreeTagger\cmd\tokenize.pl
c:\TreeTagger\cmd\mwl-lookup-greek.perl
c:\TreeTagger\cmd\filter-chunker-output-french.perl
c:\TreeTagger\cmd\filter-chunker-output.perl
c:\TreeTagger\cmd\filter-chunker-output-german.perl
c:\TreeTagger\cmd\mwl-lookup.perl
It's probably version tree-tagger-windows-3.2.zip
but I'm not sure. I've installed in December '16.
I'm having the same error. I tried following the earlier conversation but didn't find something that worked. I'm currently using the develop
version of the package and I'm using \\
instead of /
in my file paths.
> set.kRp.env(TT.cmd = "C:\\TreeTagger\\bin\\tree-tagger.exe", lang = "en")
> file <- "S:\\Jon Lehrfeld Files\\CLA Research\\PT Responses\\CWRA+\\Individual Response Files\\GreenerClassrooms\\GreenerClassrooms_749003.txt")
tagged.text <- treetagger(file = file)
Error: Awww, this should not happen: TreeTagger didn't return any useful data.
This can happen if the local TreeTagger setup is incomplete or different from what presets expected.
You should re-run your command with the option 'debug=TRUE'. That will print all relevant configuration.
Look for a line starting with 'sys.tt.call:' and try to execute the full command following it in a
command line terminal. Do not close this R session in the meantime, as 'debug=TRUE' will keep temporary
files that might be needed.
If running the command after 'sys.tt.call:' does fail, you'll need to fix the TreeTagger setup.
If it does *not* fail but produce a table with proper results, please contact the author!
In addition: Warning message:
running command 'C:\Windows\system32\cmd.exe /c C:\TreeTagger\bin\tree-tagger.exe S:\Jon Lehrfeld Files\CLA Research\PT Responses\CWRA+\Individual Response Files\GreenerClassrooms\GreenerClassrooms_749003.txt' had status 1
My sessionInfo()
:
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] koRpus_0.10-3 data.table_1.10.4 devtools_1.12.0
loaded via a namespace (and not attached):
[1] httr_1.2.1 compiler_3.4.0 R6_2.2.0 tools_3.4.0 withr_1.0.2 curl_2.6 memoise_1.1.0
[8] git2r_0.18.0 digest_0.6.12
ok, let's try to find what's going on here. from the error messages it seems as if TreeTagger isn't returning the expected results, which causes treetag()
to freak out.
@rafaleo : with debug=TRUE
treetag()
is actually supposed to return a simple matrix object; it's the original returned results of TreeTagger, only with tab separators turned into columns. if you can't donwload prebuilt binaries of koRpus
0.10-2 from CRAN, maybe the mirror is out of sync? there is a link to a package for R 3.3 on the CRAN page. however, you should always be able to force R to download the source package from CRAN and build it on your machine, by setting type="source"
in your install.packages()
call. it just defaults to binary on windows, because windows usually doesn't ship all the build environment for C and FORTRAN code out of the box, but koRpus is pure R code, so you don't need that.
from the files in your TreeTagger directory it looks like you have both files of previous installations and new ones, can that be? just a hunch: is the file C:/TreeTagger/lib/english-utf8.par
also present?
@jmlehrfeld : what do your TreeTagger directories contain? is there cmd/utf8-tokenize.perl
and lib/english-utf8.par
?
the key info that i need is the actual error that TreeTagger throws when it fails on the command line (which is usually all done silently in the background, but somehow fails now). the error message beginning with "awww..." explains how you can run that actual command causing problems yourself, while the R session is still running (so the temporary files are still in place). doing that should trigger the particular error and give us a more elaborate understanding of what's not working. in other words: we need to find a way to replicate the problem that TreeTagger seems to have. trying to run that command manually is the most straight forward way to get there.
even better would be to run it step by step, because it is actually a cascade of commands, like this (run in cmd.exe, replace"treetag()
in debug mode):
type C:\Users\cp\AppData\Local\Temp\<R temp dir>\<tokenize temp file>.txt
type C:\Users\cp\AppData\Local\Temp\<R temp dir>\<tokenize temp file>.txt | c:\TreeTagger\bin\tree-tagger.exe c:\TreeTagger\lib\english-utf8.par -token -lemma -sgml -pt-with-lemma
type C:\Users\cp\AppData\Local\Temp\<R temp dir>\<tokenize temp file>.txt | c:\TreeTagger\bin\tree-tagger.exe c:\TreeTagger\lib\english-utf8.par -token -lemma -sgml -pt-with-lemma | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;
-quiet
flag: type C:\Users\cp\AppData\Local\Temp\<R temp dir>\<tokenize temp file>.txt | c:\TreeTagger\bin\tree-tagger.exe c:\TreeTagger\lib\english-utf8.par -token -lemma -sgml -pt-with-lemma -quiet | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;
at one of those calls you should get an error message.
@unDocUMeantIt Thanks for responding. When I use the debug
option and follow the instructions, I get the attached screenshot:
Additionally, my TreeTagger directory does contain the files you mentioned. The full list is:
C:/TreeTagger/INSTALL.txt
README.txt
C:/TreeTagger/bin/chunk-english.bat
chunk-french.bat
chunk-german.bat
tag-dutch.bat
tag-english.bat
tag-french.bat
tag-german.bat
tag-italian.bat
tag-spanish.bat
train-tree-tagger.exe
tree-tagger.exe
tree-tagger-flush.exe
C:/TreeTagger/cmd/filter-chunker-output.perl
filter-chunker-output-french.perl
filter-chunker-output-german.perl
mwl-lookup.perl
mwl-lookup-greek.perl
tokenize.pl
utf8-tokenize.perl
C:/TreeTagger/lib/dutch-abbreviations
english-abbreviations
english-utf8.par
french-abbreviations
german-abbreviations
italian-abbreviations
spanish-abbreviations
spanish-mwls
Finally, I'm not sure where to get the paths from tokenize in debug mode. The only info that debug mode gives me is the system call, which provides paths to tree-tagger.exe and to the text file I'm trying to tokenize.
@jmlehrfeld ah, now i see: your call is incomplete because you only defined the path to the *.exe file but nothing else. please try again with these settings instead:
set.kRp.env(TT.cmd="manual", TT.options=list(path="C:/TreeTagger", preset="en"), lang="en")
# or
set.kRp.env(TT.cmd="manual", TT.options=list(path="C:\\TreeTagger", preset="en"), lang="en")
does at least one of those work?
I think so! I set my env as you specified, called the treetag function (without the debug
argument), and got no warning or error messages back. I guess I'm all set then. Thanks so much!
I have tested this using kkoRpus ‘0.10.2’ on a Win 7 machine running R 3.4.0 and 3.3.1 and no error. I have Win 10 @ work i'll try tomorrow. If path normalization is the issue the normalizePath
command is nice:
normalizePath(file.path("C:","Users"))
I see I didn't read the last comments here and was late to the party :-)
seems to be resolved for the moment.
I had the same problem. Nothing above worked for me. Finally, I solved the problem by updating the version of R.
I have a similar problem, tried the aforementioned methods but I wasn't able to solve it. When I try to run the following code in Rstudio, I get the following error.
> system.time(
+ lemma_tagged <- treetag(lemma_unique$word_clean, treetagger="manual",
+ format="obj",debug = TRUE, TT.tknz=FALSE , encoding = "UTF-8",lang="en",
+ TT.options=list(
+ path="C:\\Treetagger", preset="en")
+ )
+ )
split=[[:space:]]
ign.comp=-
heuristics=abbr
heur.fix=c("’", "'"), c("’", "'")
sentc.end=., !, ?, ;, :
detect=FALSE, FALSE
clean.raw=
perl=FALSE
stopwords=
stemmer=
TT.tokenizer: koRpus::tokenize()
tempfile: C:\Users\EYARBA~1\AppData\Local\Temp\Rtmp2hIXBH\tokenize2c787d7c6eb1.txt
file: C:\Users\EYARBA~1\AppData\Local\Temp\Rtmp2hIXBH\tempTextFromObject2c7867b979a2.txt
TT.lookup.command:
TT.pre.tagger:
TT.tagger: C:\Treetagger/bin/tree-tagger.exe
TT.opts: -token -lemma -sgml -pt-with-lemma -quiet
TT.params: C:/Treetagger/lib/english-utf8.par
TT.filter.command: | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;'
sys.tt.call: type C:\Users\EYARBA~1\AppData\Local\Temp\Rtmp2hIXBH\tokenize2c787d7c6eb1.txt | C:\Treetagger\bin\tree-tagger.exe C:\Treetagger\lib\english-utf8.par -token -lemma -sgml -pt-with-lemma -quiet | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;'
Error: Awww, this should not happen: TreeTagger didn't return any useful data.
This can happen if the local TreeTagger setup is incomplete or different from what presets expected.
You should re-run your command with the option 'debug=TRUE'. That will print all relevant configuration.
Look for a line starting with 'sys.tt.call:' and try to execute the full command following it in a
command line terminal. Do not close this R session in the meantime, as 'debug=TRUE' will keep temporary
files that might be needed.
If running the command after 'sys.tt.call:' does fail, you'll need to fix the TreeTagger setup.
If it does *not* fail but produce a table with proper results, please contact the author!
In addition: Warning message:
In system(cmd, intern = intern, wait = wait | intern, show.output.on.console = wait, :
running command 'C:\windows\system32\cmd.exe /c type C:\Users\EYARBA~1\AppData\Local\Temp\Rtmp2hIXBH\tokenize2c787d7c6eb1.txt | C:\Treetagger\bin\tree-tagger.exe C:\Treetagger\lib\english-utf8.par -token -lemma -sgml -pt-with-lemma -quiet | perl -pe 's\\tV[BDHV]\\tVB\;s\IN\\that\\tIN\;'' had status 255
Assuming 'UTF-8' as encoding for the input file. If the results turn out to be erroneous, check the file for invalid characters, e.g. em.dashes or fancy quotes, and/or consider setting 'encoding' manually.
Timing stopped at: 0.63 0.02 0.84
However, on the command prompt, the same thing works. I just can't get it on Rstudio. Any ideas why this might be happening? Btw, I'm relatively new in these stuff so I'm sorry if I'm missing something obvious :)
I think I have all the necessary files in the working directory since I can get some results on the cmd. To me, it seems like everything is working but just not on the platform that I want to use. Thanks a lot!
@eyyarbasi:
I have a similar problem, tried the aforementioned methods but I wasn't able to solve it. When I try to run the following code in Rstudio, I get the following error.
could you please provide some more information on your system setup?
R
and koRpus
are you using?just a shot in the dark: can you try to start a plain R session (without RStudio) and run the your R code from there? i would like to check if this issue is somehow related to the environment set up by RStudio (i don't use RStudio, it's all RKWard here ;)).
@eyyarbasi does the lemma_tagged
object that you tried to create hold any data at all?
Thanks for the reply! RStudio is v1.2.1335 and R is 3.6.0.
Here's my sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] installr_0.21.3 koRpus.lang.en_0.1-3 koRpus_0.11-5 sylly_0.1-5 SnowballC_0.6.0 topicmodels_0.2-8
[7] ldatuning_1.0.0 tidytext_0.2.1 forcats_0.4.0 stringr_1.4.0 dplyr_0.8.2 purrr_0.3.2
[13] readr_1.3.1 tidyr_0.8.3 tibble_2.1.3 ggplot2_3.2.0 tidyverse_1.2.1 magrittr_1.5
[19] readxl_1.3.1
loaded via a namespace (and not attached):
[1] modeltools_0.2-22 tidyselect_0.2.5 slam_0.1-45 NLP_0.2-0 haven_2.1.0 lattice_0.20-38 vctrs_0.1.0
[8] colorspace_1.4-1 generics_0.0.2 stats4_3.6.0 utf8_1.1.4 rlang_0.4.0 pillar_1.4.2 glue_1.3.1
[15] withr_2.1.2 modelr_0.1.4 munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0 rvest_0.3.4 tm_0.7-6
[22] parallel_3.6.0 fansi_0.4.0 sylly.en_0.1-3 broom_0.5.2 tokenizers_0.2.1 Rcpp_1.0.1 scales_1.0.0
[29] backports_1.1.4 jsonlite_1.6 hms_0.4.2 stringi_1.4.3 grid_3.6.0 cli_1.1.0 tools_3.6.0
[36] lazyeval_0.2.2 janeaustenr_0.1.5 zeallot_0.1.0 crayon_1.3.4 pkgconfig_2.0.2 Matrix_1.2-17 data.table_1.12.2
[43] xml2_1.2.0 lubridate_1.7.4 assertthat_0.2.1 httr_1.4.0 rstudioapi_0.10 R6_2.4.0 nlme_3.1-139
[50] compiler_3.6.0
And your intuition was right! It's an issue with RStudio. the object lemma_tagged
doesn't even get created in RStudio but the code works as a simple R script without RStudio. Somehow treetag()
freaks out in RStudio. Open for futher suggestions. Thanks again!
And your intuition was right! It's an issue with RStudio. the object lemma_tagged doesn't even get created in RStudio but the code works as a simple R script without RStudio. Somehow treetag() freaks out in RStudio.
that's interesting -- and a bit puzzling...
Open for futher suggestions.
during a workshop i gave recently one windows user ran into a problem with access permissions. i.e., his code would only run if he started RStudio with admin rights. IIRC, the application was unable to run the TreeTagger executable otherwise. running userland software as admin is not a solution, but if you could at least check once if this makes the problem go way, i'd get a clue where the actual issue lies.
one other hypothesis i have is RStudio's handling of system()
/shell()
calls. its terminal implementation seems to offer to run a windows version of bash, and i wonder if that could also be the case for shell()
calls, because it would render all file paths useless. so it would probably be interesting to have a look at the return values of shell()
for the command you successfully ran in cmd.exe
. this call seems to fail in RStudio (but not in plain R). if it does, you should try to run it in small units to see at which point in the call chain it actually fails, like
(shell("type C:\Users\EYARBA~1\AppData\Local\Temp\Rtmp2hIXBH\tokenize2c787d7c6eb1.txt", translate=TRUE, ignore.stderr=TRUE, intern=TRUE))
(shell("type C:\Users\EYARBA~1\AppData\Local\Temp\Rtmp2hIXBH\tokenize2c787d7c6eb1.txt | C:\Treetagger\bin\tree-tagger.exe C:\Treetagger\lib\english-utf8.par -token -lemma -sgml -pt-with-lemma -quiet", translate=TRUE, ignore.stderr=TRUE, intern=TRUE))
(shell("type C:\Users\EYARBA~1\AppData\Local\Temp\Rtmp2hIXBH\tokenize2c787d7c6eb1.txt | C:\Treetagger\bin\tree-tagger.exe C:\Treetagger\lib\english-utf8.par -token -lemma -sgml -pt-with-lemma -quiet | perl -pe 's/\tV[BDHV]/\tVB/;s/IN\/that/\tIN/;'", translate=TRUE, ignore.stderr=TRUE, intern=TRUE))
update the temporary file, of course ;) this should tell us if it already fails accessing the text file, running TreeTagger.exe
or perl
.
Hi, I encountered the same error as eyyarbasi, and I'm also using windows. I tried running the code in base R gui and with administrative privileges but the error persists. I similarly could run treetag from command line. Has there been a solution now? Thank you!
Hi, I encountered the same error as eyyarbasi, and I'm also using windows. I tried running the code in base R gui and with administrative privileges but the error persists.
in that case it is probably not the same issue. since this issue is already closed, could you please open a new one including info on your system setup (installed software packages with version numbers) and example code to reproduce the error?
thank you!
Does this
debug=TRUE
help you to understand what is the cause of the error execution?