ropensci / textworkshop17

Text Workshop at the London School of Economics, April 2017
21 stars 7 forks source link

Wrapping C/C++ libraries #5

Open jeroen opened 7 years ago

jeroen commented 7 years ago

If people know of any useful C/C++ libs that would be nice to wrap into an R package, I am happy to assist with that!

dselivanov commented 7 years ago

@jeroen I have several in my list.

  1. Compact Language Detector 2. Has zero dependencies. Should not be too hard to wrap.
  2. POS tagger. I believe "lookahead" algorithm looks promising and easily extendable to many languages. I'm aware of 2 repos: cltk/lapos and brunexgeek/nlp-tools.
  3. bigartm flexible non-bayesian framework for topic modeling - generalize LDA, PLSA.
kbenoit commented 7 years ago

Here's a parser and tagger based in C++ that could be wrapped in an R package: http://www.cs.cmu.edu/~ark/TurboParser/

benmarwick commented 7 years ago

I'd be keen to see Dynamic Topic Models (https://github.com/blei-lab/dtm) available in R. It's a major library by David Blei for analysing how topics change over time, an extension of LDA.

lmullen commented 7 years ago

👍 to @benmarwick's suggestion of Dynamic Topic Models.

dselivanov commented 7 years ago

Added bigartm - non bayesian framework for topic modeling. Online, parallel, asynchronous, very flexible. Actively developed.

jeroen commented 7 years ago

For those still following this thread: I have wrapped up Compact Language Detector 2 into an R package. Give it a go and let me know if it works: https://github.com/ropensci/cld2#readme

dselivanov commented 7 years ago

Thanks @jeroen , will do.

kbenoit commented 7 years ago

Awesome! Im running some tests now.

jeroen commented 7 years ago

OK cld2 is on cran now, will do a v1.1 next week. Let's see what else we got here :)

jeroen commented 7 years ago

I had a look at dtm but unfortunately the code is too broken to wrap in R. It has all kind of compiler warnings and doesn't build on Windows at all. It also no longer seems actively maintained.

jeroen commented 7 years ago

The cld3 package is now on cran as well. Would be fun to see someone who is into text compare cld2 and cld3 on real data.

kbenoit commented 7 years ago

How about unRTF? https://www.gnu.org/software/unrtf/.

as in https://github.com/kbenoit/readtext/issues/90

jeroen commented 7 years ago

OK here is a wrapper for unrtf: https://github.com/ropensci/unrtf