stickeritis / sticker2

Further developed as SyntaxDot: https://github.com/tensordot/syntaxdot
https://github.com/tensordot/syntaxdot
Other
12 stars 0 forks source link

generic question #98

Open jwijffels opened 3 years ago

jwijffels commented 3 years ago

Hello, I've been working lately on R wrappers for udpipe: https://github.com/bnosac/udpipe which work fine. I'd be interested in writing an R wrapper for this rust implementation as well as UDPipe 3 is waiting a bit longer than I expected.

  1. Would you be interested in such a thing as well and
  2. do you think it is uberhaupt possible to write an R wrapper (I'm mostly familiar with C++) without depending on cmake?
  3. and is sentencepiece a hard requirement (i've written this R wrapper to sentencepiece as well https://github.com/bnosac/sentencepiece) - can it easily be factored out of this rust implementation?
  4. is model building possible on other UD datasets (I can't find any docs)
danieldk commented 3 years ago

1. Would you be interested in such a thing as well and

It would be nice to have bindings for more languages, however...

2. do you think it is uberhaupt possible to write an R wrapper (I'm mostly familiar with C++) without depending on cmake?

we do not expose a C interface yet, which would probably the easiest way to bind sticker2 from other languages. This would require quite a bit of wrapping of Rust native data structures, which is not that great (unless C is the target language) and could easily break Rust's safety features. So the route I am exploring is providing a very small C interface and using protobuf for passing data structures across the language boundaries, through the ffi-support crate:

https://hacks.mozilla.org/2019/04/crossing-the-rust-ffi-frontier-with-protocol-buffers/

I have written some initial code, but not anything that it ready for consumption yet.

3. and is sentencepiece a hard requirement (i've written this R wrapper to sentencepiece as well https://github.com/bnosac/sentencepiece) - can it easily be factored out of this rust implementation?

Currently it is, but it would indeed be relatively easy to make it an optional feature. Though I am not sure if I am a fan of doing that, since quite some models use sentencepiece now.

4. is model building possible on other UD datasets (I can't find any docs)

Definitely, as long as token spans are removed (they are not yet supported by the conllu crate), it is possible to train on any UD treebank.

jwijffels commented 3 years ago

thanks for the feedback

About 2.

Not sure yet how your models are stored. R has extensive support for protobuf either using package Rprotobuf or R package protolite. I know it is possible to write an R wrapper around Rust code as indicated in https://jeroen.github.io/erum2018 and indicated at https://github.com/r-rust/hellorust. But the question was mainly, is cmake a hard dependency as CRAN does not support cmake. I think I come up to this question when digging a bit into the code and realising that sentencepiece required cmake to be installed and I was wondering hence if this external dependency can not be made more softly or if other dependencies also have this cmake requirement

About 3. My point was mainly to somehow have sentencepiece as a soft dependency such that I could use the sentencepiece R wrapper instead directly for the tokenisation part.

About 4. Are there docs available for training models?