mlpack / mlpack

mlpack: a fast, header-only C++ machine learning library
https://www.mlpack.org/
Other
5.02k stars 1.59k forks source link

Java bindings #2055

Closed Vasniktel closed 3 years ago

Vasniktel commented 4 years ago

Hello everyone. I have recently become interested in this library and I thought about implementing Java bindings for it. My question is what should I have to take a look at in the first place, what issues should be addressed and where can I learn more?

rcurtin commented 4 years ago

Hey @Vasniktel, Java bindings would be really great. It would be a good amount of work though. :) You could start to understand our automatic bindings system by taking a look at its documentation: http://mlpack.org/doc/mlpack-3.2.1/doxygen/bindings.html

A couple years ago we had a Google Summer of Code student express interest in Java bindings, but at the end of the day it was Go bindings that were implemented. Still, the discussion thread about the Java bindings could be useful: http://lists.mlpack.org/pipermail/mlpack/2018-March/003687.html

Hope this helps! Let me know if I can clarify anything.

Vasniktel commented 4 years ago

Thanks for the reply! I wanted to briefly summarize here everything that bindings should handle to make sure I'm not missing anything:

  1. provide functions for language-specific formatting of documentation comments
  2. provide mappings from Arma to some numpy-alternative for Java
  3. provide a way to allow users to efficiently pass model back and forth

As for the implementation, I thought to do the following:

  1. document formatting is not a problem
  2. I thought to use ND4J as a Numpy alternative here. I'm not sure though if it is possible to get rid of memory copying entirely.
  3. It should not be hard to represent the model as a wrapper around a pointer in Java using JavaCPP library.

Feel free to add/correct me is something is wrong here.

rcurtin commented 4 years ago

Hey @Vasniktel, right, this is correct. I think ND4J is a reasonable approach. It would be really great if there was some way to avoid memory copying, but I agree that it could be difficult. Unfortunately needing to copy the data can cause a slowdown that erases the advantage of using mlpack (vs. native Java libraries). I can see that ND4J matrices are laid out in memory in the right way:

https://deeplearning4j.org/docs/latest/nd4j-overview#inmecmory

So internally they must basically just be holding a pointer to allocated memory (even if it will just be an implicit reference in Java or something like this). I'm not entirely opposed to ugly hacks if needed. :) But the following constructor seems like it might be promising:

https://deeplearning4j.org/api/latest/org/nd4j/linalg/factory/Nd4j.html#create-double:A-int-int-int:A-long-char-

I don't know if the implementation does a copy there, though. We would, I guess, just need to get a double[] that points to the Armadillo memory via... JNI or something? I'm not sure. There are a lot of ideas; perhaps even JavaCPP could help here (and I agree that that could be useful for holding models as pointers too).

Anyway, hope that this helps. The mlpack automatic bindings typically generate a binding for another language, so, to make it a little easier to test, it's often easier to handwrite a binding (like, pca or perceptron), manually build and link against libmlpack.so and other bits, and make all of that work as expected. Once a handwritten binding is working, then it's pretty straightforward to write code that can automatically generate that binding. (For instance, this is all the file print_pyx.cpp does in the Python bindings.)

Let me know if there's anything else I can clarify! I know that the system is fairly complex, so I can maybe provide some more details if the documentation falls short (and then I guess I can update the documentation too :)).

Vasniktel commented 4 years ago

Thanks for the reply, @rcurtin. I was curious whether there is any documentation on Armadillo's memory ownership rules (e.g. how to figure out whether the memory of a matrix is owned by it or not and how to set these parameters).

rcurtin commented 4 years ago

Yes, absolutely, check this out:

http://arma.sourceforge.net/docs.html#Mat

The "advanced constructors" allow you to create a matrix wrapped around pre-allocated auxiliary memory (it won't copy it if copy_aux_mem is false.

You can also get the underlying memory of a matrix with Mat::memptr().

For knowing what the memory state of an Armadillo matrix is, the Mat::mem_state member can be used; you can take a look at armadillo_bits/Mat_bones.hpp for more information on that. However that is the "internal API", so it isn't externally documented. :)

Vasniktel commented 4 years ago

Hey, @rcurtin, I've been investigating Julia bindings and I've noticed that you used points_as_rows parameter everywhere. I'm currently writing bindings for categorical matrices, and I'm curious, whether I should use something similar as well (i.e. what was the motivation for points_as_rows)

Vasniktel commented 4 years ago

Ok, I've made a PR #2100

rcurtin commented 4 years ago

@Vasniktel wow, awesome! Usually these are pretty intense to review so it might be a little while until I can really give it a good look-over. I'm hoping to take a first look today though; we'll see how far I get. :+1:

mlpack-bot[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! :+1: