ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
184 stars 25 forks source link

Remove requirement for C++11 #26

Closed statspro1 closed 7 years ago

statspro1 commented 7 years ago

devtools::install_github("ropensci/tokenizers")

Downloading GitHub repo ropensci/tokenizers@master from URL https://api.github.com/repos/ropensci/tokenizers/zipball/master Installing tokenizers '/usr/lib64/R/bin/R' --no-site-file --no-environ --no-save --no-restore \ --quiet CMD INSTALL \ '/tmp/RtmpbgvQJi/devtoolsb60b681e3651/ropensci-tokenizers-4db5f6c' \ --library='/home/R/x86_64-redhat-linux-gnu-library/3.2' \ --install-tests

R.Version()

$platform [1] "x86_64-redhat-linux-gnu"

$arch [1] "x86_64"

$os [1] "linux-gnu"

$system [1] "x86_64, linux-gnu"

$status [1] ""

$major [1] "3"

$minor [1] "2.3"

sessionInfo() R version 3.2.3 (2015-12-10) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: CentOS release 6.8 (Final)

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] devtools_1.12.0

loaded via a namespace (and not attached): [1] httr_1.2.1 R6_2.2.0 tools_3.2.3 withr_1.0.2 curl_2.2
[6] memoise_1.0.0 knitr_1.11 git2r_0.15.0 digest_0.6.10

Thanks!

dselivanov commented 7 years ago

I'm pretty sure you need newer c++ compiler. At least gcc/g++ 4.7+

2016-11-15 21:43 GMT+04:00 statspro1 notifications@github.com:

devtools::install_github("ropensci/tokenizers")

Downloading GitHub repo ropensci/tokenizers@master from URL https://api.github.com/repos/ropensci/tokenizers/zipball/master Installing tokenizers '/usr/lib64/R/bin/R' --no-site-file --no-environ --no-save --no-restore --quiet CMD INSTALL '/tmp/RtmpbgvQJi/devtoolsb60b681e3651/ropensci-tokenizers-4db5f6c' --library='/home/mtouzani/R/x86_64-redhat-linux-gnu-library/3.2' --install-tests

  • installing source package ‘tokenizers’ ... \ libs g++ -m64 -std=c++0x -I/usr/include/R -I/usr/local/include -I"/home/mtouzani/R/x86_64-redhat-linux-gnu-library/3.2/Rcpp/include" -fpic -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -c RcppExports.cpp -o RcppExports.o g++ -m64 -std=c++0x -I/usr/include/R -I/usr/local/include -I"/home/mtouzani/R/x86_64-redhat-linux-gnu-library/3.2/Rcpp/include" -fpic -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -c shingle_ngrams.cpp -o shingle_ngrams.o shingle_ngrams.cpp: In function ‘Rcpp::CharacterVector generate_ngrams_internal(Rcpp::CharacterVector, uint32_t, uint32_t, std::tr1::unordered_set<std::basic_string<char, std::char_traits, std::allocator >, std::tr1::hash<std::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::basic_string<char, std::char_traits, std::allocator > > >&, std::vector<std::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::basic_string<char, std::char_traits, std::allocator > > >&, std::string)’: shingle_ngrams.cpp:28: error: expected initializer before ‘:’ token shingle_ngrams.cpp:35: error: expected primary-expression before ‘ngram_out_len’ shingle_ngrams.cpp:35: error: expected ‘)’ before ‘ngram_out_len’ shingle_ngrams.cpp:35: error: ‘ngram_out_len’ was not declared in this scope shingle_ngrams.cpp:36: error: ‘ngram_out_len’ was not declared in this scope shingle_ngrams.cpp:44: error: ‘len’ was not declared in this scope shingle_ngrams.cpp: In function ‘Rcpp::ListOf<Rcpp::Vector<16, Rcpp::PreserveStorage> > generate_ngrams_batch(Rcpp::ListOf<const Rcpp::Vector<16, Rcpp::PreserveStorage> >, uint32_t, uint32_t, Rcpp::CharacterVector, Rcpp::String)’: shingle_ngrams.cpp:80: error: expected initializer before ‘:’ token shingle_ngrams.cpp:83: error: expected primary-expression before ‘for’ shingle_ngrams.cpp:83: error: expected ‘;’ before ‘for’ shingle_ngrams.cpp:83: error: expected primary-expression before ���for’ shingle_ngrams.cpp:83: error: expected ‘)’ before ‘for’ make: *\ [shingle_ngrams.o] Error 1 ERROR: compilation failed for package ‘tokenizers’
  • removing ‘/home/mtouzani/R/x86_64-redhat-linux-gnu-library/3.2/ tokenizers’ Error: Command failed (1)

R.Version()

$platform [1] "x86_64-redhat-linux-gnu"

$arch [1] "x86_64"

$os [1] "linux-gnu"

$system [1] "x86_64, linux-gnu"

$status [1] ""

$major [1] "3"

$minor [1] "2.3"

sessionInfo() R version 3.2.3 (2015-12-10) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: CentOS release 6.8 (Final)

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] devtools_1.12.0

loaded via a namespace (and not attached): [1] httr_1.2.1 R6_2.2.0 tools_3.2.3 withr_1.0.2 curl_2.2 [6] memoise_1.0.0 knitr_1.11 git2r_0.15.0 digest_0.6.10

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ropensci/tokenizers/issues/26, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4u3Y5viqSAAe2uU7jwD98Ms7MFe5Sjks5q-e86gaJpZM4Ky0Ry .

Regards Dmitriy Selivanov

statspro1 commented 7 years ago

$gcc --version gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17) Copyright (C) 2010 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Can anyone help with upgrade instructions?

Thanks,

dselivanov commented 7 years ago

I would suggest to google something like "centos 6.8 gcc upgrade". Let us know if you will have problems.

2016-11-15 21:53 GMT+04:00 statspro1 notifications@github.com:

$gcc --version gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17) Copyright (C) 2010 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Can anyone help with upgrade instructions?

Thanks,

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ropensci/tokenizers/issues/26#issuecomment-260715129, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4u3UtQu1pHL4BG8u6PLwgBlUlntjH5ks5q-fGFgaJpZM4Ky0Ry .

Regards Dmitriy Selivanov

statspro1 commented 7 years ago

I did upgrade Centos but I am having the same issue.

cat /etc/centos-release CentOS release 6.8 (Final)

gcc --version gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-15)

Thanks,

dselivanov commented 7 years ago

@statspro1 g++ --version ? ~/.R/Makevars standart ?

statspro1 commented 7 years ago

@dselivanov

g++ --version g++ (GCC) 4.8.2 20140120 (Red Hat 4.8.2-15)

nano ~/.R/Makevars CXXFLAGS = -O3 -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic $(LTO) #set_by_rstan R_XTRA_CPPFLAGS = -I$(R_INCLUDE_DIR) #set_by_rstan

dselivanov commented 7 years ago

Hm, looks strange.. I noticed that it compiles with -std=c++0x, not -std=c++11. Can you try to set PKG_CXXFLAGS = -std=c++11 in your Makevars ?

statspro1 commented 7 years ago

@dselivanov

cc1plus: error: unrecognized command line option "-std=c++11" make: *\ [RcppExports.o] Error 1 ERROR: compilation failed for package ‘tokenizers’

statspro1 commented 7 years ago

@dselivanov

When I run system('g++ -v') from R, I get gcc version 4.4.7 20120313. Not sure I understand. Using built-in specs. Target: x86_64-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile --enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux Thread model: posix gcc version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC)

statspro1 commented 7 years ago

Also:

Sys.setenv("PKG_CPPFLAGS"="-std=c++11") devtools::install_github("ropensci/tokenizers") Downloading GitHub repo ropensci/tokenizers@master from URL https://api.github.com/repos/ropensci/tokenizers/zipball/master Error: Could not find build tools necessary to build tokenizers

dselivanov commented 7 years ago

@statspro1 I have no idea why this happens. I suggest to ask stackoverflow.

mo58 commented 7 years ago

When I say

install.packages("tokenizers") in R. I get the following output:

sessionInfo() gives:

R version 3.3.2 (2016-10-31) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: Oracle Linux Server 6.8

I have gcc version 4.8.2 and binutils 2.23.52.

Everything is up to date I think. Can someone help me solving this problem?

mo58 commented 7 years ago

I have fixed this :] Just be careful with the gcc version (I did it with 4.9.1) when you install it. An user can have a different version than the sudo/root. Just to avoid this, set the new version of the gcc explicitely with for example scl enable devtoolset-3 bash. Or put this in your .bash_profile file. Becareful that Rstudio can use another gcc version. Just install the tokenizers package in the terminal and this will sync automatically with Rstudio.

I have fixed it like this. Perhaps there are other formal ways doing this. Hopefully this will solve your problem

wooopenr commented 7 years ago

what would be the correct syntax to install tokenziers then? I have Linux redhat 6.8 and gcc-c++ 4.4.7

lmullen commented 7 years ago

@wooopenr I think you need gcc 4.7 or greater.

Ironholds commented 7 years ago

Would avoiding C++11 code help with this? (I don't see where it, strictly-speaking, needs C++11, and this would widen the package usage somewhat).

If that's an acceptable outcome I'm more than happy to make the changes myself, this clears up an issue with tidytext, which I'm doing some tweaking of.

Ironholds commented 7 years ago

(FWIW as a C++ dork with some suddenly-freed-up-time I'm more than happy to help out with the compiled component of the package as a general thing!)

lmullen commented 7 years ago

@Ironholds Yes, undoubtedly removing the C++11 requirement would help some people.

It's very generous of you to offer your expertise. The compiled portions were originally contributed by @dselivanov but I'd be fine with you removing the C++11 requirement as long as performance doesn't take a major hit. He may want to weigh in.

My other major item on my wish list, if you have time and are so inclined, is to get a proper skipgram tokenizer as pointed out in #24.

Ironholds commented 7 years ago

Ooh fun. Okay, I'll take a stab at it and do my best (I can actually see a couple of points where it's designed less-than-efficiently and I could improve speed!)

Ironholds commented 7 years ago

@wooopenr can you try devtools::install_github("ironholds/tokenizers") and see if that works? I've reworked the code as C++98 rather than C++11, which should widen the range of compilers it works on. If it does, I'll push it up to the main repo.

@lmullen in terms of speed this actually makes it slightly faster - not because C++98 is faster, but because I pointed a spot where a vector (comprised of continguous memory) was being repeatedly pushed back against, and appending to a C++ vector is a bad idea just as appending to an R vector is. Luckily there's a data structure called a deque which is designed non-contiguously and so does a better job. I switched them out; result:

# Old version
Unit: microseconds
                          expr     min      lq     mean  median       uq     max neval
 {     tokenize_ngrams(song) } 111.161 113.803 121.0693 115.237 126.7895 307.687   100

# New version
Unit: microseconds
                          expr     min       lq     mean   median       uq     max neval
 {     tokenize_ngrams(song) } 103.959 104.9385 113.7413 106.0265 112.2165 252.147   100

Tinkering has given me some ideas for a couple of additional niceties which I'll run by you tomorrow or on Saturday.

dselivanov commented 7 years ago

but because I pointed a spot where a vector (comprised of continguous memory) was being repeatedly pushed back against, and appending to a C++ vector is a bad idea just as appending to an R vector is.

@Ironholds, It is far less bad compared to R. C++ memory allocation as very different from R's append. Especially when we reuse this vector-buffer. clear just set size to 0, but it doesn't shrink the memory, so on subsequent calls there should not be memory allocations. But yes, deque can be a little bit more efficient in this case.

Ironholds commented 7 years ago

Yeah, that's fair - although presumably there will be if different input vectors are of different lengths, no? That is, if I pass in a list of 3 vectors of 1, 3 and 6 elements, there will be 1, 2 and 3 allocations respectively, clear() aside.

lmullen commented 7 years ago

@Ironholds Thanks for taking this on. I think it will be helpful to not depend on C++11, which has been the biggest source of bug reports (even if the problem is just people's older compilers).

In testing your version on a smallish corpus, I'm seeing a slight decrease in performance. I'm running the attached script with Rscript --vanilla. Can you please try running this benchmark and let me know what you think? My results are in the attached file, but your results may vary.

ngram-benchmark.zip

Ironholds commented 7 years ago

Ooh, I didn't know you could attach files to GitHub comments. Neat!

I'm seeing a slight difference at my end, but only slight.

C++11:

Change in memory: 187 MB
Benchmark: Unit: seconds
                                                             expr     min       lq     mean   median       uq      max neval
 out <- tokenize_ngrams(docs, n = 5, stopwords = stopwords("en")) 2.10606 2.190927 2.372687 2.307381 2.533542 2.888489    20

C++98:

Change in memory: 187 MB
Benchmark: Unit: seconds
Unit: seconds
                                                             expr      min       lq     mean   median       uq      max neval
 out <- tokenize_ngrams(docs, n = 5, stopwords = stopwords("en")) 2.263016 2.296217 2.408799 2.369919 2.463877 2.835295    20

So actually slightly faster worst-case, at my end. Mind you, it's 20 iterations on 2 different CPUs and shouldn't be trusted to be a uniform experience, so I wouldn't take the better worse-case in my test as a guarantee.

lmullen commented 7 years ago

@Ironholds I think the difference here is trivial. So I'll be glad to merge your version if you send a pull request.

Just a note: glad to see that you added yourself in the DESCRIPTION, but there is a small error in your Authors@R.

lmullen commented 7 years ago

And as for testing on an old compiler, I'll see if I can do it on one of the servers at my university, or in Docker.

Ironholds commented 7 years ago

thumbs up. Yeah, spotted the error this morning, should now be fixed in my fork. I'll throw it in as a PR now but fwiw I'd definitely advocate testing before merging, given that it's an open issue rather than an enhancement.

lmullen commented 7 years ago

Hmm. The local sysadmin moved us from CentOS to Ubuntu at my request, and apparently the university computing cluster upgraded the cluster to gcc 5.2. So I no longer have ready access to out of date compilers. :-) I'll test the PR on Docker, though.

lmullen commented 7 years ago

I've tested this on CentOS 6.8 with R 3.3.3 and g++ 4.4.7. Compilation of the current CRAN version (which requires C++11) failed, but compilation of the current master branch was successful. So I'm closing this issue. Thanks very much, @Ironholds.

Ironholds commented 7 years ago

yay!