What do we want to build?

rust-ml / discussion

A space to discuss the future of the ML ecosystem in Rust.

108 stars 3 forks source link

What do we want to build? #1

Open LukeMathWalker opened 5 years ago

LukeMathWalker commented 5 years ago

Welcome!

I created this repository as a discussion hub for the ML ecosystem in Rust, "following" a talk I gave at the Rust meetup in London (slides).

I do believe that Rust has great potential in this area, but to fully realize this potential we need to provide building blocks: we need to tackle those shared challenges that, once removed, will enable more and more people to just come to Rust and build what they want to build.

The three building blocks I do see as fundamental for an ML ecosystem are:

n-dimensional arrays;
dataframes;
an ML model interface.

I have spent the last year, when it comes to open-source contributions, enhancing n-dimensional arrays: direct contributions to ndarray, statistical routines on top of it (ndarray-stats) and tutorials to help people to get into the Rust scientific ecosystem from Python, Julia or R. I do believe that ndarray is in more than a good shape when it comes to fulfil NumPy's role in the Rust ecosystem.

There is now movement as well when it comes to dataframes - a discussion is taking place at https://github.com/rust-dataframe/discussion/issues/1 to explore use cases and potential designs. (The idea of opening this repository comes directly from this experiment of community-led design for dataframes).

Given that one of the two data structures that are usually consumed by ML models is ready (n-dimensional arrays) and the other one is baking (dataframes) I think it's time to start thinking about what to do with the ML-specific piece.

I don't want to steer the debate too much with the opening post (I'll chip in once the discussion starts), but the questions I'd like to see tackled are:

what use-cases could make Rust shine in the ML ecosystem?
what are the basic capabilities that have to be built to enable the usage of Rust for ML workloads?
how should we structure such a project? A core library with few traits and a set of separate crates tackling different aspects? A large battery-included scikit-learn equivalent?
why do you want to use Rust for ML?

Kibouo commented 5 years ago

I want to note that, while it works great, https://github.com/twistedfall/opencv-rust is not particularly user-friendly or 'clean' in rust terms.

Maybe we could have a look at it?

flo-dhalluin commented 5 years ago

I think the use case that coud make Rust shine, is deployment. Currently the de-facto "mainstream" stack is python based ( scikit-learn, np, pandas + your DL framework of choice, TensorFlow, Torch .. ). It shines for fast prototyping, because python, but it sucks for industrialization ( and deployment ), because ... python. I really think rust would do great. In that area, I kinda like TensorFlow serving, but it forces you to have a separate service ( that you call with their protobuf/RPC). So :

nice conventions for training/inference
standard ways of serializing, loading models, and expose them to more "entreprisey" stacks, either with some kind of FFI ( for ex. jvm <-> jni ..) or RPC.
with all the goodies required for a industrial setups ( monitoring, robustness, ease of deployment ...).

jbowles commented 5 years ago

I'm currently building a large project with rust (i mention it here: https://users.rust-lang.org/t/interest-for-nlp-in-rust/15331/9), where I am doing the data engineering in rust (lots of string metrics). [tldr; I found lots of disparate projects with 50% of what I needed for string metrics but instead rolled my own, trying to incorporate previous work and give credit] I want to feed the feature vectors to Julia to experiment with what I want to use for classification and modelling, and then I'll want to be able to use rust for inference/classification etc.... I had to pause development for business reasons, but I'm starting again: one of my biggest issues was not ML related but finding a nice pattern for parallel file download (seems like it should be simple, but maybe I'm spoiled by go's simplicity lol).

From this real-world project point of view, as well from my time spent thinking in the abstract and surveying the ML ecosystem in rust (about a year), I would think that a focus on data engineering in general and serving models is the way to go (this also seems to be a widely shared sentiment). In a practical sense I would like to see rust jobs for data engineer and machine learning engineer... that is, the bookends of a typical data science project; serving the data and serving the model.

That is, targeting software developers, infrastructure, computational math, and data people. Trying to convince research scientists to use Rust would be wasted effort; for most of these people software is a secondary skill and so they need something easy to learn, dynamically typed, with a REPL... I've watched this play out in the Python/R/Matlab versus Julia world... and while IMO Julia has a lot to offer current python/r/matlab devs and is similar enough to those languages, trying to get that group of people to use Julia is not easy, i can't imagine what it'd be like proposing Rust.

Here are some challenges I see:

Dataframes: figuring out what to do with missing data is a challenge (i watched julia community struggle with that this last year).
LinearAlgebra: ndarray, nalgebra are both active projects... is there duplicated effort? (there are others as well).
Rust types more friendly for math: I've seen the power in Julia of being able to specify AbstractArray as a type, or have a Real as a type, that allows you to build generic functions that accept a vector of float32 or float64.
Swift: google and numerous well-known people (chris lattner[LLVM, Swift], jeremy howard[fast.ai]) have put their support behind swift for tensorflow. IMO swift has really long way to go. But for Rust, tackling areas where the swift-for-tf project are not focusing on is good.
Support for Julia: integration with Python is a necessity; but if there is a competitor for the research-scientist in the Python world it is Julia and I'd imagine keeping an eye on playing well with Julia could be a benefit. Competition here is hard to forecast and julia/rust are on really different ends of the spectrum; while julia pushes solving the "two language" problem i see no problem using rust and julia in project; I doubt competition is an issue, not like with Swift.

jbowles commented 5 years ago

I do believe that ndarray is in more than a good shape when it comes to fulfil NumPy's role in the Rust ecosystem.

Really looking forward to digging into ndarray. Though I've had a slight delay, I'm writing up ndarray examples for the Grokking Deep Learning book where andrew trask introduces deep learning with only numpy. He's expressed interest and welcomed the examples... :)

soaxelbrooke commented 5 years ago

A standardized tokenization implementation!

Tokenization fills the role of "turn the text into fixed vectors" that you'd feed into standard models. As an NLP practitioner and Rust user, tokenization is an incredibly important step in the pipeline, a big barrier to new people trying to apply NLP, and a place where lots of small bugs creep in due to non-standard implementations that take forever to find. Having a standard implementation for the simpler tokenization methods (like regex matching) would make NLP problems much more approachable in Rust.

DhruvDh commented 5 years ago

One part of machine learning where Rust could shine right now is simulation for Reinforcement learning.

For instance if I training an agent to play blackjack, the biggest bottleneck here is the "playing" blackjack over and over by the agent to collect enough data for training.

Rayon and Actix could be used to create fast and performant game "environments" now, without need for an established ML ecosystem.

yngtodd commented 5 years ago

I agree with @DhruvDh, using Rust to simulate environments for RL agents would be great.

Having something akin to OpenAI's gym interface would be really nice. Many RL researchers are going to still want to use Python and all the associated deep learning libraries. So, I would love to see RL environments rendered in Rust that could be interfaced with both Python and Rust for agents.

Edit: I imagine that algorithms like Monte Carlo Tree Search would be really useful if they were written in Rust. I would not want to wait on Python to handle that bit.

masonk commented 5 years ago

if I training an agent to play blackjack, the biggest bottleneck here is the "playing" blackjack over and over by the agent to collect enough data for training.

Along these lines, I am working on a hobby project (link), which does this. It isn't quite ready for even an alpha release yet, but I am in the final stages of cleaning up the API with the intent to publish it.

masonk commented 5 years ago

Things Rust definitely needs:

const generics 16-bit floats GATs (for efficient, non-copying iterators)

Things that we might want but I'm not sure: Standard Inference + Train traits Standard data frames trait

kazimuth commented 5 years ago

I've been thinking of building a rust deep learning / GPU compute library on top of the TVM framework for a while now. I think it could address a lot of the things @flo-dhalluin is talking about. TVM's an amazing project that's currently flying a bit under the radar. It's an open source deep learning compiler - it compiles deep neural nets / large array operations to run on the GPU (or on OpenCL, or FPGA, or TPU, or WebGL...). You define an AST of computations via its API, and it spits out a small (<5mb) shared library containing just the operations you wanted, on whatever acceleration framework and target platform you want.

It currently has a working Rust runtime library, which lets you call a compiled model from Rust. It integrates with ndarray, and will let you e.g. take in an ndarray::Array, move it to a GPU, run whatever numerical operations you want on it, and get the result back as an ndarray::Array again.

That's pretty neat, and I don't think it would be too hard to build some really cool tools on top of it. My dream is something like:

lib.rs:

// a crate based on tvm
// `cargo build` will (by default) download + checksum a prebuilt TVM library
// that this links to, so that you don't have to wait for a whole compiler to compile.
// The download will only be ~50mb -- way smaller and easier than lots of other deep
// learning frameworks. It will also support running code on things besides cuda!
// The output binary won't need to link the compiler (by default) and will therefore be
// only a few megabytes.
extern crate tvmrs;

// a procedural macro that converts Rust code to Relay IR.
// Relay IR is TVM's high-level IR for defining neural networks / computation chains,
// sorta like a tensorflow Graph. It's also not too dissimilar to Rust.
// The macro will compile the IR with TVM at build-time, and link the resulting artifacts
// to this rust library.
tvmrs::accelerate! {

  // stateless operation
  fn relu_downsample(x: Tensor[c, n, h, w]) -> Tensor[c, n, h/2, w/2] {
     relu(downsample(x))
  }

  // stateful operation
  struct Block<oc> {
    conv: Conv2d<3,3,oc>,
    elu: Elu
  }
  impl Op for Block<oc> {
    fn run(self, input: Tensor[c, n, h, w]) -> Tensor[oc, n, h, w] {
       self.elu(self.conv(input))
    }
  }

  fn swap_channels(x: Tensor[2, n, h, w]) -> Tensor[2, n, h, w] {
    // a low-level tensor operation defined as a TVM Tensor expression.
    let out = compute!(x.shape, |cc, nn, hh, ww| x[(c + 1) % 2, nn, hh, ww)]);
    out
  }

  // a sequential network container.
  sequential! Network {
     #[opencl] Block<3,3,5>, // run on opencl
     #[opencl] relu_downsample,
     #[opencl] Conv2d::new(1,1,2),
     #[rust] debug,    // call a normal rust function
     #[cpu] swap_channels // run this part on CPU to maximize throughput
  }

  // Compute a derivative of the network.
  // Relay IR is designed to be differentiable.
  derivative! NetworkDerivative (Network);
}

// a normal rust function
fn debug(x: Tensor) {
  ...
}

train.rs:

fn main() {
  tvmrs::training_loop! {
    net: Network,
    dnet: NetworkDerivative,
    epochs: 37,
    training_data: dataset! {...},
    valid_data: dataset! {...},
    ...
  }
}

run.rs:

fn main() {
   let input = tvmrs::ndarray_from_stdin();
   let output = Network::load_params("params.bin").run(input);
   println!("{:?}", output);
}

(Further reading: Introduction to Relay, TVM Tensor expressions)

All of this is of course pending mountains of bikeshedding, i have no idea what the final API will look like.

One of the nifty things here is that this isn't limited to deep learning models. TVM can handle pretty much any algorithm made of large array operations. So if you wanted to run your SVM on GPU, you can do that pretty easily!

Steps to take here:

Talk to the TVM people and see what they think of all this. We could do this work under their umbrella or in a fresh project.
Write Rust bindings to the TVM compiler (instead of just the runtime). TVM is written in C++ but is designed to be easy to bind, a lot of the work has already been done here.
Design an API like my sketch above that wraps the bindings in some way that makes them easy to use for training + deployment.
Build up cargo tooling to allow e.g. prebuilt binary downloads, TVM's auto-tuner support, etc.
Beef up TVM's autodifferentiation support. TVM can differentiate Relay IR, but a lot of derivatives aren't actually implemented yet. We could also roll our own autodifferentiation system and just use TVM for compilation; I'd prefer to avoid duplicating work tho.
Start writing non-deep-learning algorithms with this system as well, to kick the tires.

If people are interested in this implementation path we could throw a repo together and start work.

I mainly want this because I'm don't want to be stuck using Python and Cuda all the time for my deep learning research :)))

koute commented 5 years ago

A few months ago I have started a crate of my own for deep learning. My goal is to have a library which:

Supports both inference and training.
Supports the most common deep neural network architectures.
Is GPU accelerated.
Doesn't use CUDA.
Supports every mainstream platform (Linux, MacOS, iOS, Android, Windows, WebAssembly) and hardware (AMD, NVIDIA, Intel GPUs) with a single codebase, and uses the same kernels for consistent results.
Is written in pure Rust so that it's trivial to cross-compile.
Has a simple to use Keras-like API.
Is small and simple enough that it can be reasonably understood and tested end-to-end. (Otherwise you risk situations like e.g. with TensorFlow where for two whole versions their dropout layer was completely broken.)

It's currently totally useless. Right now I'm in the process of adding a Vulkan backend (I have a few thousand lines of work-in-progress code on my disk which I've not pushed yet.); once I finish that in a few weeks I plan further build it up so that I can train CIFAR-10 up to at least ~90% accuracy, add some model import/export functionality (probably from/to the ONNX format) and only then it will be actually usable for something practical.

Some people would call this a waste of time and effort, and, well, I do agree that it would be probably more productive to not do this completely from scratch as I'm doing (e.g. by using TVM as kazimuth said), but I don't really care - I'm just trying to scratch my own itch.

DhruvDh commented 5 years ago

@kazimuth while I love the snippets you've shown here, a lot of my love for Rust exists because of the all the compile time checks the compiler does, and the wonderfully easy to comprehend error messages. I feel that if one is using Rust just as a way to compose, and run functionality defined in other languages then there isn't much to gain here. Might as well just use Python.

And TVM looks more like a tool for deploying neural nets rather than training them; which is very useful but I would prefer to do both in Rust.

There's also tch-rs - bindings to PyTorch's libtorch.

Something else that is also interesting is dual_num, which as I best understand it is some fancy math that might eventually let us to automatic differentiation.

DhruvDh commented 5 years ago

@koute the long term road-map is amazing but I don't get why bother putting effort into the tensorflow backend. Admittedly I don't have enough know-how to imagine what a native backend would look like and the kind of work it would need.

koute commented 5 years ago

@DhruvDh The TensorFlow backend will be most likely removed in the future. Currently it is there for a few reasons:

I wanted to quickly get something working to experiment with, and to be able to first work on the general interface of the library (e.g. defining the neural network graph, getting data in and out, etc.)
I can use it to write a comprehensive test suite and then cross-check that with my own backend. ML algorithms are very hard to write correctly, so I want the extra insurance not only that my algorithms match with what I have on paper, but also with another widely used framework. (Although from the amount of bugs I've encountered when dealing with TensorFlow it'd probably would have been better to pick a different framework...)

jbowles commented 5 years ago

Some cool stuff coming to light. Is anyone familiar with work presented at c4ML? https://www.c4ml.org/ I don't think any of the presentations were using Rust... but certainly this is a space Rust could be competitive with. With that in mind, are any of the Rust compiler team interested in ML?

Here are some references to work being done in Swift and Julia (Note, Rust, Swift, Julia were all top of the list for google's tensorflow project that eventually became swift-for-tf). (e.g., automatic differentiation, differentiable programming... https://github.com/tensorflow/swift/blob/master/docs/AutomaticDifferentiation.md, https://juliacomputing.com/blog/2019/02/19/growing-a-compiler.html). Swift MLIR (https://drive.google.com/file/d/1hUeAJXcAXwz82RXA5VtO5ZoH8cVQhrOK/view) and Julia Zygote (https://www.julialang.org/blog/2018/12/ml-language-compiler).

I don't know of any projects in Rust along these lines ^^ ... of course, they are also all funded (google, and julia computing).

DhruvDh commented 5 years ago

@koute yeah makes sense.

@jbowles There was this internals thread about Automatic Differentiation here.

jbowles commented 5 years ago

@ehsanmok may be interested in this discussion ^^

thanks @DhruvDh

kazimuth commented 5 years ago

@DhruvDh that's a fair criticism, but really that's a problem whenever you want to use a hardware accelerator. You're always going to be calling into a language with different semantics from the host. Using Rust for glue gives you type-safety, performance, and lovely tooling. e.g. it's dead-simple to write a parallel image preprocessing pipeline in Rust, whereas with python you need a load of hacks (FFI, multiprocessing) to get acceptable performance. Also, you're free to define new low-level operations in Rust; users shouldn't ever need to use another language :)

And yeah, currently TVM's publicity is oriented around deployment, because that's where there's a gap in the python ecosystem. There's no reason their compiler wouldn't work for training too, though.

@jbowles I've worked with some of those projects; see my comment, I think we can borrow some of that work.

also CC @nhynes

kazimuth commented 5 years ago

Other thought: I wonder what interactive scientific programming would look like in Rust? There's a jupyter kernel but i'm not sure how usable it is.

It might be that rust should just be used for high-performance kernels and stuff, and be easy to call from other languages like you lay out in your presentation @LukeMathWalker.

LukeMathWalker commented 5 years ago

Wow, there really is a lurking interest 😛 This is just great.

The discussion has explored several different directions, I'd like to give more details on what I do envision (and where that need comes from).

I strongly align with @flo-dhalluin: I think Rust can really shine in delivering an end-to-end production workflow. Rust has incredible potential when it comes to the beginning (data pipelines, preprocessing) and the end (performance web servers, using multiple protocols) of the ML workflow. Establishing early on a way to get the whole workflow is going to be a key prerequisite for adoption - filling a painful gap in the ML ecosystem at large, delivering a top-notch experience with great tooling.

Tackling this challenge requires the building blocks I mentioned (n-dimensional arrays, dataframes) and some others that have been brought up (e.g. running code on different types of hardware, easy interop, reading/writing to a lot of different formats).

Certain capabilities can be borrowed from other languages, others we should probably port and develop natively in Rust (a sufficiently large zoo of preprocessing techniques and standard models).

While I do understand the interest in the Deep Learning area, I don't think it's realistic to kickstart an effort to make Rust a primary language for NN development: we should definitely be able to deploy and run NN models (the TVM project is an excellent example here), but I don't think we would be adding a lot of value by chasing huge projects like TensorFlow or PyTorch. There are a lot of things in the TensorFlow ecosystem, instead, that are extremely interesting (e.g. TensorFlow serving) but they do end up locking you into TensorFlow itself: if we could replicate those conveniences in a framework-agnostic fashion, we could definitely capture a need in that space.

Summing it up, the minimum working prototype that I have in mind to show off what Rust can do goes along these lines:

Huge datasets as input;
Heavy-weight, massively parallel data preprocessing pipeline (e.g. NLP or images would be good candidates);
Very simple model to be trained on top of the pipeline output;
Configuration-based deployment of the serialized model using Rocket: you just define very basic things in a YAML file (e.g. HTTP vs gRPC, monitoring, logging, etc.) and you get a fully working web server that serves your model. This will have to rely on a sufficiently general Model trait.

If you could manage to get the experience right, I am quite sure the interest in Rust for this kind of use cases would skyrocket.

koute commented 5 years ago

While I do understand the interest in the Deep Learning area, I don't think it's realistic to kickstart an effort to make Rust a primary language for NN development: we should definitely be able to deploy and run NN models (the TVM project is an excellent example here), but I don't think we would be adding a lot of value by chasing huge projects like TensorFlow or PyTorch.

I agree, however, you're looking at it from a perspective of a data scientist who wants to fill in the gaps of their existing workflow and augment their ML pipeline with Rust. I'm looking at it from a perspective of a Rust developer who just wants to augment their existing application with a little ML without going through the hoops of exporting their data, processing it through a mainstream ML framework, and serializing it back so that it can be used by the application again.

In other words - my personal interest lies in not filling a gap in the existing ML ecosystem (although that's also most certainly worthwhile!), but in filling a gap in the Rust ecosystem by creating value for existing Rust users (and perhaps the users of other languages) so that they could take advantage of ML in a plug-and-play fashion with minimal amount of fuss. (Which is why things like wide hardware and platform support, simplicity, lack of non-Rust dependencies so it's easy to build and cross-compile, etc. is important.)

jbowles commented 5 years ago

I can volunteer work to rust-ml for tokenizers, string distance metrics, and/or onehot encoding package. I've already been working on the first two as I have real-world projects that need these so I can double up. As far as a onehot package I'm interested to learn more how efficient onehot encoding is done under the hood and have a use for the package as well.

string distance metrics (jaro, jaro-winkler, ngram, qgram, ratcliff-obershelp)
tokenizers: for one, rust is awesome for writing tokenizers. But IME it's kinda hard to write general tokenizers since their use is often highly dependent on per-project needs (for example I wrote this [https://github.com/jbowles/nlpt-tkz] and used it for a project and its not found much use since). Or if there were consensus on using something ntlk tokenizers as a guide I don't mind working on those either. If there is a need for things like the examples below then I can cherry pick these out of my current project (a hotel and product matching thing) for a rust-ml package... these were written specifically for string comparison and not typical tokenization found in nlp pipelines but it would not be to hard adapt them to accept and return a specific data type...

#[cfg(test)]
mod tests {
    use super::*;
    #[test]
    fn on_word_splitter() {
        fn word_split(c: char) -> bool {
            match c {
                '\n' | '|' | '-' => true,
                _ => false,
            }
        }
        let res = TokenizerNaive::word_splitter("HelLo|tHere", &word_split);
        assert_eq!(res, vec!["HelLo", "tHere"])
    }
    #[test]
    fn on_tokens_lower_filter() {
        fn tokens_filter(c: char) -> bool {
            match c {
                '-' | '|' | '*' | ')' | '(' | '&' => true,
                _ => false,
            }
        }
        let res = TokenizerNaive::tokens_lower_with_filter("|HelLo tHere", &tokens_filter);
        assert_eq!(res, " hello there");

        let res1 = TokenizerNaive::tokens_lower_with_filter("HelLo|tHere", &tokens_filter);
        assert_eq!(res1, "hello there");

        let res2 = TokenizerNaive::tokens_lower_with_filter("HelLo tHere", &tokens_filter);
        assert_eq!(res2, "hello there");

        let res6 =
            TokenizerNaive::tokens_lower_with_filter("****HelLo *() $& )(tH*ere", &tokens_filter);
        assert_eq!(res6, "    hello     $    th ere");
    }

    #[test]
    fn on_pre_process() {
        let res = TokenizerNaive::pre_process("Hotel & Ristorante Bellora");
        assert_eq!(res, "hotel ristorante bellora");

        let res1 = TokenizerNaive::pre_process("Auténtico Hotel");
        assert_eq!(res1, "auténtico hotel");

        let res2 = TokenizerNaive::pre_process("Residence Chalet de l'Adonis");
        assert_eq!(res2, "residence chalet de l adonis");

        let res6 = TokenizerNaive::pre_process("HOTEL EXCELSIOR");
        assert_eq!(res6, "hotel excelsior");

        let res6 = TokenizerNaive::pre_process("Kotedzai Trys pusys,Pylimo ");
        assert_eq!(res6, "kotedzai trys pusys pylimo");

        let res6 = TokenizerNaive::pre_process("Inbursa Cancún Las Américas");
        assert_eq!(res6, "inbursa cancún las américas");
    }

    #[test]
    fn on_tokens_alphanumeric() {
        let res3 = TokenizerNaive::tokens_alphanumeric("|HelLo tHere");
        assert_eq!(res3, " HelLo tHere");

        let res4 = TokenizerNaive::tokens_alphanumeric("HelLo|tHere");
        assert_eq!(res4, "HelLo tHere");

        let res5 = TokenizerNaive::tokens_alphanumeric("HelLo * & )(tHere");
        assert_eq!(res5, "HelLo       tHere");
    }

    #[test]
    fn on_tokens_lower() {
        let res = TokenizerNaive::tokens_lower_str("HelLo tHerE");
        assert_eq!(res, "hello there")
    }

    #[test]
    fn on_tokens_simple() {
        assert_eq!(
            TokenizerNaive::chars("hello there"),
            ["h", "e", "l", "l", "o", " ", "t", "h", "e", "r", "e"]
        );
        assert_eq!(
            TokenizerNaive::chars("hello there").concat(),
            String::from("hello there")
        )
    }

    #[test]
    fn on_similarity_identity() {
        assert_eq!(TokenCmp::new_from_str("hello", "hello").similarity(), 100);
    }

    #[test]
    fn on_similarity_high() {
        assert_eq!(TokenCmp::new_from_str("hello b", "hello").similarity(), 83);
        assert_eq!(
            TokenCmp::new_from_str("this is a test", "this is a test!").similarity(),
            97
        );
        assert_eq!(
            TokenCmp::new_from_str("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear").similarity(),
            91
        );
    }
    #[test]
    fn on_token_sequencer() {
        let an = AlphaNumericTokenizer;
        let one = an.sequencer("Marriot &Beaches Resort|").join(" ");
        let two = an.sequencer("Marriot& Beaches^ Resort").join(" ");
        assert_eq!(one, two);
    }
    #[test]
    fn on_token_sort() {
        let s1 = "Marriot Beaches Resort foo";
        let s2 = "Beaches Resort Marriot bar";
        assert_eq!(TokenCmp::new_from_str(s1, s2).similarity(), 62);
        let sim = token_sort(s1, s2, &TokenCmp::new_sort, &TokenCmp::similarity);
        assert_eq!(sim, 87);
    }
    #[test]
    fn on_token_sort_again() {
        let s1 = "great is scala";
        let s2 = "java is great";
        assert_eq!(TokenCmp::new_from_str(s1, s2).similarity(), 37);
        let sim = token_sort(s1, s2, &TokenCmp::new_sort_join, &TokenCmp::similarity);
        assert_eq!(sim, 81);
    }
    #[test]
    fn on_amstel_match_for_nate() {
        let sabre = "INTERCONTINENTAL AMSTEL AMS";
        let ean = "InterContinental Amstel Amsterdam";
        assert_eq!(TokenCmp::new_from_str(sabre, ean).similarity(), 20);
        assert_eq!(TokenCmp::new_from_str(sabre, ean).partial_similarity(), 14);
        assert_eq!(
            token_sort(sabre, ean, &TokenCmp::new_sort, &TokenCmp::similarity),
            79
        );

        assert_eq!(
            token_sort(
                sabre,
                ean,
                &TokenCmp::new_sort,
                &TokenCmp::partial_similarity
            ),
            78
        );
    }

    #[test]
    fn on_partial_similarity_identity() {
        let t = TokenCmp::new_from_str("hello", "hello");
        assert_eq!(t.partial_similarity(), 100);
    }

    #[test]
    fn on_partial_similarity_high() {
        let t = TokenCmp::new_from_str("hello b", "hello");
        assert_eq!(t.partial_similarity(), 100);
    }

    #[test]
    fn on_similarity_and_whitespace_difference() {
        let t1 = TokenCmp::new_from_str("hello bar", "hello");
        let t2 = TokenCmp::new_from_str("hellobar", "hello");
        let sim1 = t1.similarity();
        let sim2 = t2.similarity();
        assert_ne!(sim1, sim2);
        assert!(sim1 < sim2);
        assert_eq!(sim1, 71);
        assert_eq!(sim2, 77);
    }

kazimuth commented 5 years ago

Summing it up, the minimum working prototype that I have in mind to show off what Rust can do goes along these lines:

This is a very cool idea :)

Question: what would a general Model trait look like? I think the challenge is striking a balance between generality and specificity; you don't want to tie people down too much, but you need some sort of understanding of what you're doing to be able to use it in a general context.

We might want to brainstorm a list of goals / requirements for the design, before we start writing code. Maybe in another issue?

@jbowles

But IME it's kinda hard to write general tokenizers since their use is often highly dependent on per-project needs

Do you think it would be possible to do something with a trait-based approach here? Like, the rust pattern of building up a stack of combinators, you get Parallel<Lower<UnicodeSplitter<...>>> and it ends up near-handwritten performance? I don't know much about NLP, so forgive me if I'm missing stuff here.

jbowles commented 5 years ago

@kazimuth yes i think that would be the way; allow the user to compose a tokenizer.

The TokenizerNaive i showed above is naive specifically because it is not trait based; it does some text normalization for the user, allowing the user to build and pass in a function for char matching/filtering.

I do have a trait-based approach (ideas i got from this Text-Analysis-in-Rust-Tokenization) in my current project but those are in service of tokenizing for comparing token similarity.

With full-blown tokenization an API should support allowing a user to compose the various things they need (e.g., a char filter, normalizing text, etc...) like your example. The hard part I'm really referring to is the output of the tokenization. For example,

I have functionsequencer that returns Vec of tokens

Vec<std::borrow::Cow<'a, str>>;

First, I'm new enough to rust to still not totally understand all the consequences of using Cow :) ... and also instead of a Vec<> it likely needs to return a different kind of vector that plays well with onehot or word embeddings, etc... If you are familiar with python scikit-learn think of the "Vectorizers" it has for turning arrays of strings into arrays of numbers [IMO this is always the hardest part of NLP]

texts = ["foo bar", "bar foo zaz", "did bar", "zaz bar jazz", "good jazz zaxx"]

tfidf = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1, 2))
features = tfidf.fit_transform(texts)
pd.DataFrame(features.todense(), columns=tfidf.get_feature_names())

d_vtz = CountVectorizer()
print(d_vtz.fit_transform(texts))

h_vtz = HashingVectorizer()
print(h_vtz.fit_transform(texts)

It seems what one would want in rust is a tokenizer that returns vectors of tokens that can just be "plugged in" to lots of different ways to turn text into numbers.

davechallis commented 5 years ago

I'd like to see more individual specialised components that form part of an ML pipeline, rather than anything monolithic attempting to implement too much at once.

This gives Rust a chance to build up its ML strengths, slowly replacing individual parts of a mature ML pipeline. Having e.g. python bindings to those components would also allow them to start getting used and proving benefit, without needing a 100% switch to Rust.

Modules/components I'd love to see:

text vectorisation (e.g. fast/parallel versions of count/tfidf vectorisers)
dimensionality reduction (e.g. PCA, tSNE)
scaling/normalisation
hyperparameter optimisation
data stucture interop (e.g. to/from pandas/arrow/parquet etc.)

LukeMathWalker commented 5 years ago

Question: what would a general Model trait look like? I think the challenge is striking a balance between generality and specificity; you don't want to tie people down too much, but you need some sort of understanding of what you're doing to be able to use it in a general context.

We might want to brainstorm a list of goals / requirements for the design, before we start writing code. Maybe in another issue?

An article I found very interesting, from 2 years ago, is this one: http://athemathmo.github.io/2016/09/07/typesystem-machine-learning.html It's from the author of rusty-machine if I am not mistaken. We should definitely brainstorm a list of goals and requirements here before starting to write code out. It would also be worthwhile to see what features in the lang team pipeline could be useful for us.

I agree, however, you're looking at it from a perspective of a data scientist who wants to fill in the gaps of their existing workflow and augment their ML pipeline with Rust. I'm looking at it from a perspective of a Rust developer who just wants to augment their existing application with a little ML without going through the hoops of exporting their data, processing it through a mainstream ML framework, and serializing it back so that it can be used by the application again.

In other words - my personal interest lies in not filling a gap in the existing ML ecosystem (although that's also most certainly worthwhile!), but in filling a gap in the Rust ecosystem by creating value for existing Rust users (and perhaps the users of other languages) so that they could take advantage of ML in a plug-and-play fashion with minimal amount of fuss. (Which is why things like wide hardware and platform support, simplicity, lack of non-Rust dependencies so it's easy to build and cross-compile, etc. is important.)

My loyalty is divided, to say the least: I'd love to be able to host 100% of my workflow in Rust because I strongly believe in the language potential and in the potential of the tooling around it. I wouldn't say though that our goals are at odds @koute : it's just a matter of deciding in which order we should be tackling challenges. A good set of crates for preprocessing and deployment is going to be just as necessary for a purely Rust-based workflow as they are for a mixed-language workflow. Once they are established, we can then shift focus on porting more and more models and algorithm to Rust. I wholeheartedly agree with @davechallis:

I'd like to see more individual specialised components that form part of an ML pipeline, rather than anything monolithic attempting to implement too much at once. This gives Rust a chance to build up its ML strengths, slowly replacing individual parts of a mature ML pipeline. Having e.g. python bindings to those components would also allow them to start getting used and proving benefit, without needing a 100% switch to Rust.

Thanks to the strong packaging and distribution story provided by Rust, the effort of flashing out algorithms and preprocessing tools can be extremely distributed: once there is a set of agreed-upon traits as interfaces, we can leverage the influx of people who are fascinated and allow them to be productive and develop new crates without having to worry about the fundamentals. That's why I think it's strategical to have a pure Rust implementation of DataFrames and n-dimensional arrays, for instance. We don't need a huge monolith like SciPy or Scikit-learn.

swfsql commented 5 years ago

@kazimuth that Jupyter kernel is usable; I'm starting to learn ai with it in here: https://github.com/swfsql/deep-learning-coursera (by oxidizing python code) (Currently, only the first assignment is in Rust)

jbowles commented 5 years ago

This gives Rust a chance to build up its ML strengths, slowly replacing individual parts of a mature ML pipeline. Having e.g. python bindings to those components would also allow them to start getting used and proving benefit, without needing a 100% switch to Rust. 💯

Seems to me one of the more difficult problems doing this in rust is getting common types and traits and types defined for different packages to interface with. If I'm not mistaken @LukeMathWalker you seem to point towards using Ndarray as basically numpy. I'm all on board with that.

What if there were something like a core package that defined some of the core traits and structs and types? I can see lots of pros/cons for doing that.

kazimuth commented 5 years ago

@jbowles RE: tokenizer API Hm, I see the challenge there. Well, for one thing you should probably use Iterators in between operations instead of Vecs, or design a similar trait to Iterator; that should reduce the problem of having to have big buffers between each transformation. Then I think the path would be to pick-and-choose input requirements for each operation, and then operations output whatever they want. E.g. HashVectorizer takes impl Iterator<impl Deref<str>>, and then users can pass in Iterator<&str>, Iterator<String>, Iterator<Cow>, whatever.

This gets at a broader problem with a simple function-y Model(Input) -> Output trait; it works for in-memory datasets, but once your dataset is large enough that you want to start streaming / distributing work over multiple machines, the abstraction sorta breaks down. We could instead do something graphy, where you just have nodes that ingest and spit out streams of data... but then we'll have to work with something graphy, with nodes that ingest and spit out streams of data :P

It might make sense to just start implementing without a core crate of traits, and once we've smacked into enough walls in the design space, we can figure out what the interfaces to our systems tend to look like, and retrofit a core design around that.

nhynes commented 5 years ago

Although I'm not sure that Rust is going to usurp Python and C++ as the de-facto ML programming model, it's definitely a worthy goal. Along those lines, I think that flashlight (and the underlying arrayfire library) has an interface that we might want to emulate.

In any case, the real key feature of PyTorch and JAX is the expressivity of Python backed by a high-performance JIT tensor compiler. I'm pretty sure it's possible to do something similar in Rust by writing a compiler plugin that tracks the types+ops of ndarrays and provides the data to a JIT compiler.

Maybe something like

#[jit]
fn mlp(
    data: &Array<2, f32>,
    weights: Vec<&Array<2, f32>,
    labels: &Array<1, u8>
) -> f32 {
    let fc1 = data.dot(weights[0]); // fn dot -> Array<D, T, Op=gemm>
    Array::pointwise_max(0, fc1) // Array<D, T, Op=Max<0, fc1>> 
}

This is just a sketch and depends on how cost generics actually pan out, but the idea is that a compiler plugin can find the #[jit] functions and either pre-compile them or add them to a runtime cache and replace the original definition with a call into the cache. This is not quite dissimilar to TVM's hybrid mode. We probably don't want to write a tensor compiler, so we could offload that to TVM and link in the static library.

LukeMathWalker commented 5 years ago

This gets at a broader problem with a simple function-y Model(Input) -> Output trait; it works for in-memory datasets, but once your dataset is large enough that you want to start streaming / distributing work over multiple machines, the abstraction sorta breaks down. We could instead do something graphy, where you just have nodes that ingest and spit out streams of data... but then we'll have to work with something graphy, with nodes that ingest and spit out streams of data :P

It might make sense to just start implementing without a core crate of traits, and once we've smacked into enough walls in the design space, we can figure out what the interfaces to our systems tend to look like, and retrofit a core design around that.

The best abstractions are always the result of some concrete experiments in my opinion. I do foresee a constant iteration cycle between trait design and algorithm implementation, to progressively tweak it and make it fit for purpose. On the other side though, it is not necessary to have a god-like trait - it's perfectly feasible to have a nuanced collections of traits mapping to different capabilities that do not necessarily have to be implemented by all models.

rth commented 5 years ago

I can volunteer work to rust-ml for tokenizers, string distance metrics, and/or onehot encoding package.

@jbowles I have implemented the first two in https://github.com/rth/vtext, if we can find some way to collaborate to avoid duplicate effort, that would be great. There are likely a number of things that could be improved.

Modules/components I'd love to see: text vectorisation (e.g. fast/parallel versions of count/tfidf vectorisers)

A preliminary version for vectorization is implemented there as well, the parallel vectorization is WIP in https://github.com/rth/vtext/pull/20

for one thing you should probably use Iterators in between operations instead of Vecs, or design a similar trait to Iterator

Yes, I went with Iterators as well. Maybe we can open a separate discussion about this, there are some related thoughts in https://github.com/rth/vtext/issues/21

Having e.g. python bindings to those components would also allow them to start getting used and proving benefit, without needing a 100% switch to Rust.

I also see this as the most likely path to adoption, if only because of the number of existing ML users.

danieldk commented 5 years ago

I'd like to see more individual specialised components that form part of an ML pipeline, rather than anything monolithic attempting to implement too much at once.

I strongly agree:

Large monolithic frameworks tend to get outdated pretty quickly. NLTK, which has been mentioned here is a good example. While it looks nice on the surface, most components are nowhere near the state-of-the-art in NLP. It mostly encodes what NLP was like in the mid-2000s. I don't mean this in a harsh way, NLTK is a great accomplishment, a great resource for education, and still usable for more lightweight NLP.
A lot of machine learning, at least from the perspective of NLP, requires a lot of language and task-specific tuning. If a model is deployed, it needs to be tailored to the environment where is used. Large frameworks typically make this kind of tailoring hard.

Small, orthogonal, libraries are much easier to combine for specific scenarios. Also, they will enable people to build larger (competing) components on top of them. The Python ecosystem is a good example, there are many things that have run its course (e.g. Theano, NLTK, scikit-learn to some extend), but the basic building blocks (numpy, matplotlib) are reused over and over (e.g. in PyTorch, Tensorflow, spaCy, etc).

More practically, it would be strange to work on big frameworks when the basic building blocks are still very incomplete. To give some examples that I have run into when using Rust for NLP:

I absolutely love ndarray. It is very close to numpy in terms of ergonomics. We virtually use it in all our projects. However, there are no strong guarantees that your code gets properly autovectorized and unrolled (due to strict FP), especially outside matrix multiplication. Maybe things have improved, but when we made finalfrontier, we had to hand-roll functions with SIMD intrinsics to make it competitive to C code.
I also like ndarray-linalg, since it makes LAPACK available. But we do not have most LAPACK primitives in native Rust. This has the downside that one always has to link an external BLAS library. Moreover, the default most performant open source option (OpenBLAS) is not compatible with multi-threaded applications. So, it's multithreaded OpenBLAS or a multi-threaded application, pick one.
There is no full-fledged Rust alternative to matplotlib.
Basic data formats are not supported. There seems to be movement in the hdf5-rust repo, but last time I tried, you could not actually read data from HDF5 containers using the hdf5 crate and the repository was dormant.

I think it would be useful to identify the most basic, albeit important, building blocks that are missing and gather a group of people around them to implement and maintain them under one umbrella. I think gonum is a really nice example of a project that implemented most of the basic plumbing in Go as a coherent whole.

rth commented 5 years ago

We don't need a huge monolith like SciPy or Scikit-learn

Scipy maybe not, but for a common ML library, I'm less sure. Beyond the estimators, it implements, the value of scikit-learn is in defining a common API, comprehensive documentation comparing models and high selection criteria for included models. When I use a model from scikit-learn, I know it's relatively well established, and that the implementation was thoroughly reviewed (though I might be biased there ^^). I can also quickly see what other similar models are out there in the documentation and what are the differences between them. A number of packages claim compatibility with scikit-learn API, which means that they should pass the check_estimator test, but in practice not that many do.

A package with common traits may take care of the API, but as to the other point they are still somewhat open questions. Maybe the code doesn't need to part of the same crate but under a common organization with some acceptance process similar to scikit-learn-contrib. But then having some way of listing/searching all the models and comparing them might still be beneficial.

Also isn't rusty-machine and possibly rustlearn already a good candidate for the (classical) ML crate in Rust.

rth commented 5 years ago

Large monolithic frameworks tend to get outdated pretty quickly. NLTK, which has been mentioned here is a good example.

Doesn't that mostly apply to the fast-evolving DL eco-system though? If you take some more basic things like a linear model, PCA, tokenization it hasn't really changed that much in the last 10 years. Also your perspective is that of an NLP researcher well aware of the field. If you take a beginner Python developer, say for string distance, he is still more likely to try NTK over any one of specialized packages IMO (and then look for a better solution if performance matters). Sometimes one doesn't even know the right keywords to search for.

I think gonum is a really nice example of a project that implemented most of the basic plumbing in Go as a coherent whole.

It is a nice example, however, is there an example of that for ML? Do you think Tensorflow, PyToch or Spacy would have the visibility they have now if they started as 10 separate projects under the same umbrella ? Also, there need to be some way of ensuring maintenance of all these projects, if/when their developer(s) moves on. Currently, a fairly new Rust ecosystem has already a number of half-abandoned projects, which makes choosing the right project more difficult.

I don't disagree about avoiding monolithic packages for Rust, just pointing out that there are challenges that need to be addressed in that case.

From https://users.rust-lang.org/t/interest-for-nlp-in-rust/15331/11

I think it would be worthwhile to set up a Rust NLP working group that aims to work towards creating something like spaCy in Rust.

@danieldk I would be interested as well. How do see this working, a repository with issues in some github org could be a start?

danieldk commented 5 years ago

It is a nice example, however, is there example of that for ML?

I am not sure, I have not really used Go for a while. Maybe I should have stated my view more succinctly: I largely agree with you, also that a more traditional ML library would be useful to a lot people. But I think the current problem is that the basics linear algebra, visualization, and data format crates are not in a shape yet to have a thriving pure Rust ML ecosystem.

I think it is definitely possible to do production-level ML in Rust by leveraging existing C++ libraries such as Tensorflow or PyTorch.

I think it would be worthwhile to set up a Rust NLP working group that aims to work towards creating something like spaCy in Rust.

@danieldk I would be interested as well. How do see this working, a repository with issues in some github org could be a start?

Sounds good! I would really be interested in such an effort and I think if we'd leverage Tensorflow or PyTorch, we could get something off the ground pretty quickly. I already have some of the components, e.g. sticker does part-of-speech tagging & dependency parsing (and other sequence labeling tasks, such as named entity recognition) [1]. And you and @jbowles have already worked on tokenization.

[1] Admittedly, it needs some clean-ups to be ready for general use.

sebpuetz commented 5 years ago

I would also be interested in joining the effort. @danieldk and I have been collaborating on some NLP tools in Rust, e.g. finalfrontier.

Recently, I have been working on a crate to handle constituency trees lumberjack. The crate is still pretty much work-in-progress, at this point it's possible to read various flavours of the bracketed (PTB) format and NEGRA (also non-projective) trees. There's some support to process those trees through e.g. projectivization, filtering+insertion of non-terminal nodes and (un-)collapsing of unary chains but it's still quite far away from being generally useful.

The sentiment in this thread seems to be that large, general purpose libraries are not desirable, hard to maintain and more specific projects are preferable. I agree with this sentiment, but too much fragmentation is also undesirable. The different projects should be easy to find (maintain some registry of active NLP projects?) and perhaps allow some interoperability by, for instance, using common traits that could be defined in some core crate. The issue raised by @rth about maintenance in a fragmented ecosystem should also be considered. Many independent crates come with many independent developers who can independently become inactive and drop development of the respective crate.

jbowles commented 5 years ago

@rth awesome, I'm all about collaboration; I'm going to checkout the package after work today! I also found an implementation of punkt tokenizer rust-punkt!

I agree with @danieldk about

Small, orthogonal, libraries are much easier to combine for specific scenarios. Also, they will enable people to build larger (competing) components on top of them. The Python ecosystem is a good example, there are many things that have run its course (e.g. Theano, NLTK, scikit-learn to some extend), but the basic building blocks (numpy, matplotlib) are reused over and over (e.g. in PyTorch, Tensorflow, spaCy, etc).

Everything I've ever done for any company that was NLP-related required custom tokenizers and special text preprocessing... so much so that it was easier to write our own than bring in big dependencies for the small bits we did need. So I would think composability and extendability are pretty important for parts of a ML/NLP API in general.

value of scikit-learn is in defining a common AP

Implementation aside, mono-repo or bunch of packages, I think it's not too controversial to say a standard and easy to interface with API [where standard is more important than ease of use] is the general goal.

Bringing this back into focus so it doesn't get lost is @LukeMathWalker's original list of

- n-dimensional arrays;
- dataframes;
- an ML model interface.

I think there's some broad agreement that these are good to focus on (i would tuck file formats under Dataframes... though I don't know if the dataframes package has that as a priority ??). Also, IMO, a rust-ml meta-project has room for the kinds of projects @yngtodd, @masonk, and @koute are involved with since a generally standardize-able set of API's and interfaces and/or traits benefits everyone. I also agree with @LukeMathWalker that we are not going to find this standardization through design by committee and it'll take some iterations.

maybe some small and focused experimental packages could be started as a means to simply iterate??

I think it would be cool too to spread the word a bit about some of the discussion here. I happen to know that Target (the store) is writing a new data analytics platform in rust... there is dataflow by frank mcsherry -- which i believe he got funding and started a company around, as well the recent acceptance of andy grove's datafusion into the apache arrow project... I don't know how many of these people talk or know each other but I imagine it'd be good.

stokhos commented 5 years ago

I'm a newbie in ML, I think in order to make rust shine in ML, we should think of include more basic optimization algorithms. there are many nonlinear algorithms that have faster convergence rate and maybe better local optimal than ADAMS and Gradient Descent, there are also cases when people are looking for integer solutions. I'm not sure if these are inlcuded in tensorflow or other DL frameworks, but it would be cool if people can find them here.

tqchen commented 5 years ago

Hi guys, just want to jump in and say that the TVM community would definitely love to see more things to happen in the tvm&rust.

We have folks in this thread (@nhynes @kazimuth) who are actively developing the tvmrs stack. @jroesch who hacked rust compiler before is now leading the Relay IR effort and we are having a serious conversation to enable more native cross-language native experience(support rust as a native backend language). And yes, training(besides inference) is on the scope of Relay IR.

I think the best benefit also comes from general interpolations. For example, it would be really awesome if the rust-ndarray can be built with DLPack in mind.

DLPack itself is a C based structure for exchanging in-memory tensors, and it has been used by tvm, pytorch, chainer, mxnet, and a few other projects to do in-memory tensor data exchange. As a matter of fact, the DLPack structure itself might be a good candidate to build in-memory structure for NDArray. TVM directly build its NDArray around DLTensor.

One potential problem that many ndarray projects eventually has to address in the future is to define support for GPUs/accelerators, and DLPack structure provides a relatively minimum agreed guideline to do so.

ehsanmok commented 5 years ago

Thanks @jbowles for mentioning me :pray:

First, I can divide the strategy for "What do we want to build?" as informed move vs. ignorant move. Rust itself as a language shines over any other existing languages because of how open it is to offered inputs and no-ego rules in the language development. This is something I value the most and I'd love to see happening in future RustML WG!?

For RustML I envision two overlapping categories;

Traditional ML
Deep learning oriented

over short term vs. long term (aka building from scratch) plans. I can see some values for long term development in either categories, but that's not something that interests me personally now. However, insisting only on building things from scratch, in my opinion, comes from ignorant. Solutions developed over many years with equivalent of millions of dollars of development time and effort cannot be ignored. Also Rust itself is not mature for that; const-generic, GAT, GPU and std::simd support are important requirements.

My interests lie around deep learning and since there are already much mature projects available, binding Rust to those existing ones is very appealing to me as the short term plan (right now, tensorflow/rust and tch-rs exist but they need a lot of polishing). After binding has happened then one can assess whether it'd worth it rebuilding things from scratch or not. Companies want to first see the value of introducing a new tool and based on my experience it's a sane way as a short term plan.

Since I have some experience with TVM and it has some good Rust support for inference, I'd like to focus more on that, especially because DL system frontier is happening in the compiler area and I think will go deeper. I'd love to see Relay enabled training and more integration of Rust with TVM in that direction.

LukeMathWalker commented 5 years ago

@danieldk

Small, orthogonal, libraries are much easier to combine for specific scenarios. Also, they will enable people to build larger (competing) components on top of them. The Python ecosystem is a good example, there are many things that have run its course (e.g. Theano, NLTK, scikit-learn to some extend), but the basic building blocks (numpy, matplotlib) are reused over and over (e.g. in PyTorch, Tensorflow, spaCy, etc).

@rth:

We don't need a huge monolith like SciPy or Scikit-learn

Scipy maybe not, but for a common ML library, I'm less sure. Beyond the estimators, it implements, the value of scikit-learn is in defining a common API, comprehensive documentation comparing models and high selection criteria for included models. When I use a model from scikit-learn, I know it's relatively well established, and that the implementation was thoroughly reviewed (though I might be biased there ^^). I can also quickly see what other similar models are out there in the documentation and what are the differences between them. A number of packages claim compatibility with scikit-learn API, which means that they should pass the check_estimator test, but in practice not that many do.

A package with common traits may take care of the API, but as to the other point they are still somewhat open questions. Maybe the code doesn't need to part of the same crate but under a common organization with some acceptance process similar to scikit-learn-contrib. But then having some way of listing/searching all the models and comparing them might still be beneficial.

Also isn't rusty-machine and possibly rustlearn already a good candidate for the (classical) ML crate in Rust.

@sebpuetz:

The sentiment in this thread seems to be that large, general purpose libraries are not desirable, hard to maintain and more specific projects are preferable. I agree with this sentiment, but too much fragmentation is also undesirable. The different projects should be easy to find (maintain some registry of active NLP projects?) and perhaps allow some interoperability by, for instance, using common traits that could be defined in some core crate. The issue raised by @rth about maintenance in a fragmented ecosystem should also be considered. Many independent crates come with many independent developers who can independently become inactive and drop development of the respective crate.

[...] It is a nice example, however, is there an example of that for ML? Do you think Tensorflow, PyToch or Spacy would have the visibility they have now if they started as 10 separate projects under the same umbrella ? Also, there need to be some way of ensuring maintenance of all these projects, if/when their developer(s) moves on. Currently, a fairly new Rust ecosystem has already a number of half-abandoned projects, which makes choosing the right project more difficult.

I don't disagree about avoiding monolithic packages for Rust, just pointing out that there are challenges that need to be addressed in that case.

I think both positions are valid: on one side, having few, well-maintained, comprehensive libraries gives more visibility to the projects and makes it more likely that they are going to be maintained in the long run. On the other side though, I can't help but feel that projects like Scikit-learn or SciPy bring their own challenges:

dependency cost: having to bundle the whole SciPy project when I just need sparse matrices, for instance, has caused me a lot of problems when trying to deploy ML models on AWS Lambda. Looking at the Rust ecosystem, I can clearly see a future where both the WASM and the Embedded stories could use what we are planning to do here. Bringing in 1GB of dependency code in a static binary when you probably need 1/1000th of the offered functionality is definitely going to be a problem;
iteration speed: once a model lands in Scikit-learn, it's quite difficult to iterate on it, if any of those improvements has to break an interface. You can't bump a major version (or minor, given that they are using 0.x.x) every two months. Given that Scikit-learn is the sum of a lot of loosely-coupled packages, this can lead to rot and sub-optimal implementations being pushed as "reference", similarly to what happens with some submodules of Python's standard library;
maintenance burden: given that both Scikit-learn and SciPy are perceived as "packages", they are bound to obey the semver rules for their releases. This often means that once something gets accepted, it's there to stay. This leads to a high amount of pressure on maintainers and probably an excessively conservative attitude when it comes to accept new functionality inside those libraries. Yes, what gets accepted is usually extremely good (at the time it was evaluated), but how many useful projects are never found because they were deemed to be "too risky" to be accepted?

I think we should learn from these experiences and from the overall trends of the Rust community. My personal proposal would be something along these lines:

One or more small orthogonal core crates containing all the relevant traits (for models, pipelines, input, outputs, etc.). They will follow semver for their releases;
Several, purpose-focused crates, each providing a specific set of functionalities (e.g. tree-based algorithms, SVMs, Gaussian Processes, etc.). They will follow semver for their releases;
a large, monolithic, batteries-included crate that re-exports all the core crates as well as high-quality crates for each category of functionality (same structure of Scikit-learn). No new code or development would take place here: there is no functionality that is not available in one of the re-exported crates.

This huge crate would provide a single entrypoint to showcase what is available in the ML Rust ecosystem, ensuring high visibility to all the relevant projects out there. I envision you could start out your project using this large crate to play around: you don't know what you need, you just want to quickly plug and play a bunch of different things until you get something working. Once you are satisfied with what you have, you can just drop it as a dependency and bring in the specific subsets you need. Given that this batteries-included crate makes no attempt to be a single-package, it should not follow semver conventions. We could just do periodic releases (19.04, 19.06, etc.), a little bit like OSes do, with no guarantee on the provided interfaces - you are not supposed to depend on this for your production code after all. This would make it easy to drop crates that are no longer relevant, thus enabling us to be more audacious when it comes to bet on a new crate.

I think it would also make sense to have an organization owning both the huge crate, the core crates and all the crates that wish to be re-exported in the huge crate: this would make sure that if the maintainers of any of those sub-projects cannot keep going, there is the option for the community to take the reins and ensure the sub-project does not die (if it's deemed to be relevant, of course).

What are you thoughts on this @danieldk @rth @sebpuetz?

LukeMathWalker commented 5 years ago

Re-reading the whole thread, I'd say that we have identified the problem spaces where there is community interest to get started with some work. We have also surfaced, for each of these problem spaces, what are the pros and cons of "build vs buy": port to Rust or start out by providing good bindings to existing projects.

I'd like to try to synthesize the different positions and work streams, so that we can move from there and start planning some concrete action (as well as spin-off separate threads to discuss each of them in-depth). I'll try to find some time between today and tomorrow to do it :+1:

LukeMathWalker commented 5 years ago

So far, these seem to be the different areas of focus:

General-purpose pre-processing A sufficiently complete toolkit to mold raw data before getting into a machine learning model (dimensionality reduction, scaling/normalisation, discretization, missing value imputation, etc.).
NLP Rust seems to have plenty of small crates, more or less experimental, for dealing with natural language (see also this discussion on Rust internals). It's probably time to coordinate: identify functionality gaps, generalize some of the existing crates and create a coherent story.
Classical ML There are several unmaintained crates (e.g. rustlearn, rusty-machine). The language and the ecosystem have moved considerably forward since those experiments and now it's a good time to isolate core components and provide a set of community-backed interfaces for ML models and pipelines, making sure they are fit for purpose by trying them out while implementing or porting classical ML algorithms (trees, SVMs, Gaussian Processes, clustering, etc.). Some of these algorithms already exists, scattered in different crates, at different stages of maturity and maintenance - coordination and re-use should be promoted whenever possible.
Deep Learning There are bindings for both PyTorch (tch-rs) and Tensorflow (tensorflow), but they could use some love to fill in the gaps and polish the API. There is ongoing work in the TVM project: there are Rust bindings for the runtime, but no bindings to the TVM compiler. TVM could be used as backend for a high-level DL library in Rust, thus providing access to a wide range of different hardwares. At the same time, there is interest in exploring the feasibility and ergonomics of a pure-Rust crate for DL.
Deployment Bridging the gap from a working prototype to production: Rust has a chance to shine here, putting to use well known crates such as rocket. Getting a model exposed behind a performant API should be a matter of configuration: what protocol do I want to use (HTTP/gRPC/etc.), where is the model, do I want to expose Prometheus metrics, where do I send the logs. Done, shipped.
DataFrames There is an ongoing discussion here on what challenges a DataFrame library in Rust should tackle and what design constraints it should obey.
Interoperability No ecosystem is an island: we need to be able to serialize and deserialize the most common file formats in the ML/data space - hdf5, Parquet, NumPy's npy, etc. Each of these format has seen some work, but the maturity of the corresponding crate might vary.
Reinforcement Learning It has been mentioned that Rust could be used to develop training environments/gyms for RL models. Is there latitude to interact with the work of the Rust gamedev community?
Computer Vision opencv-rust was mentioned, and I know @xd009642 is working on ndarray-vision. The topic was not extensively discussed in this thread, but there might be a wider community interested that we haven't been able to intercept here.

To move forward, I assume we could divide ourselves into smaller groups to tackle some of the mentioned topics. For each of these smaller groups, it would be nice to know who is interested in contributing and what is a rough idea of the roadmap moving forward. We could then circulate them in the wider Rust community: as long as we provide easy entrypoints and mentorship, I think we can expect more people to join the effort and get involved.

We could use this repository/organization as the home of an informal Rust ML working group, in order to keep a centralized discussion hub to make sure that efforts do not remain hidden and isolated in this early phase of the ecosystem.

jbowles commented 5 years ago

I am willing to devote time to NLP/pre-processing and Computer Vision;

NLP

gather existing/old resources and contact developers about contributing
contribute to existing projects

Computer Vision

i have a need for rust bindings to OpenCV

jblondin commented 5 years ago

I'm interested in contributing to DataFrames and General-purpose pre-processing.

I'm already involved in the dataframes discussions and have a WIP I've been working on.

For pre-processing, I think it'd be great to start discussing needs / goals and initial design ideas.

I think we should be careful not to silo these groups too much -- there's a lot of interconnectedness among them, and design decisions in one will definitely affect others. For instance, any dataframe implementation should definitely keep Interoperability in mind!

How would online / streaming / non-in-memory machine learning fit into this breakdown, or would it be an additional focus area?

jblondin commented 5 years ago

We could use this repository/organization as the home of an informal Rust ML working group, in order to keep a centralized discussion hub to make sure that efforts do not remain hidden and isolated in this early phase of the ecosystem.

I think I missed this in my first read-through; this would help avoid some of the silo-ing issues I was concerned about.

LukeMathWalker commented 5 years ago

Personally, I'd be happy to devote the bulk of my time to the Classical ML work stream, extending to General Preprocessing, Deployment and DataFrames if needed/when the bulk of the work is done.

How would online / streaming / non-in-memory machine learning fit into this breakdown, or would it be an additional focus area?

I think of it as a cross-cutting concern, in the sense that each area has to do "its homeworks" to enable streaming ML.

rth commented 5 years ago

Thanks for this summary @LukeMathWalker ! For each of these topics, I think a first step could be to list exiting crates (https://github.com/anowell/are-we-learning-yet does some of that already), get some input from their maintainers, and also review what existing solutions exist in the C++/Python/etc space.

For NLP, would you mind creating say rust-ml/nlp-discussion repo in this org ? I agree that some of the problems solved by these different groups are related and it might be preferable not to isolate them too much, at least in the beginning.

LukeMathWalker commented 5 years ago

For NLP, would you mind creating say rust-ml/nlp-discussion repo in this org ? I agree that some of the problems solved by these different groups are related and it might be preferable not to isolate them too much, at least in the beginning.

For discussion purposes, I'd say that it would enough just to create a new issue here (in the spirit of keeping things close together and visible). What do you think?