karllessard / tensorflow-java-sandbox

Experimenting on TensorFlow Java client modularization
2 stars 2 forks source link

Java Repository Layout and Artifacts #1

Open karllessard opened 5 years ago

karllessard commented 5 years ago

Let's start this issue to add suggestions/comments in regard of the proposed layout and artifacts for the upcoming TF Java repositories, that will be based on this one.

karllessard commented 5 years ago

The current layout propose the following hierarchy:

karllessard commented 5 years ago

Renaming the artifacts: The actual names for the main core artifacts are libtensorflow for the java part and libtensorflow_jni for the native one.

We can take this opportunity to give them better names that are more Java-like than C-like.

First, do we keep the groupId as org.tensorflow? or should we switch to something new for the SIG, like org.tensorflow.jvm?

Also, I don't know about the artifacts that would come out of the usage of JavaCPP but for the current implementation, I suggest those new names to begin with:

libtensorflow -> tensorflow-java libtensorflow_jni -> tensorflow-native

saudet commented 5 years ago

In my opinion, these 2 artifacts should be merged, because neither can be used independently anyway. Moreover, I don't see anything in the build files for Bazel in there that we can't already do with JavaCPP, which can do more, so we only lose in functionality by keeping Bazel to manage JNI-enabled modules there. tensorflow-native, I think that would be a good name for this module yes. We'd get multiple artifacts out of it with different classifiers, making it easy to do CI with it as well, and I would also add a tensorflow-native-platform module that depends on all these artifacts with classifiers, but which would be transitively configurable like this via profiles: https://github.com/bytedeco/javacpp-presets/wiki/Reducing-the-Number-of-Dependencies

As for the group name, I say we should keep the org.tensorflow group as is. Adding redundant suffixes like "java" or "jvm" just makes names longer. It's pretty obvious that Maven modules are for Java.

karllessard commented 5 years ago

In my opinion, these 2 artifacts should be merged, because neither can be used independently anyway.

I like the idea of keeping those 2 artifacts separated because the Java core just need to be build and deploy once while the native core needs to be rebuild/redeployed for each configuration and OS we want to support. The user project can depend on many versions (classifier) of it without any issue, as there won't be any clash of duplicated Java classes.

Having distinct artifacts allows you to include/exclude manually which platform and features you want from the native layer.

For example, let say you just want support for linux (ubuntu), you can add those individual dependencies: org.tensorflow:tensorflow-java org.tensorflow:tensorflow-native:linux-x86_64

On the other hand, if size does not matter for you, you can simply depends on one of the starter artifacts that automatically includes the "common" versions of the library, like it is done now: org.tensorflow:tensorflow

This will automatically include to your classpath: org.tensorflow:tensorflow-java org.tensorflow:tensorflow-native:linux-x86_64 org.tensorflow:tensorflow-native:darwin-x86_64 `org.tensorflow:tensorflow-native:windows-x86_64'

The current library loader logic can take care of selecting the right library at runtime.

We could have another starter call org.tensorflow:tensorflow-gpu to include the same libraries but with GPU enabled.

I also don't have a clear list of what other OS we need to support but we can build the tensorflow-native artifact for any of them or any distribution (e.g. CentOS)

I don't see anything in the build files for Bazel in there that we can't already do with JavaCPP

I don't have a clear idea of what JavaCPP can or cannot do but of course, if we don't need to keep Bazel we'll simply drop it. Maybe you can create a PR with the changes you would like to apply to replace JNI/Bazel with JavaCPP, so I/we can understand better what it implies?

I would also add a tensorflow-native-platform module that depends on all these artifacts with classifiers, but which would be transitively configurable like this via profiles

IMHO, I think simple maven inclusion/exclusion is pretty basic and should fulfill most of our cases, instead of relying on additional build variables. Maybe you can explain a bit more why we would need those with some examples?

As for the group name, I say we should keep the org.tensorflow group as is. Adding redundant suffixes like "java" or "jvm" just makes names longer. It's pretty obvious that Maven modules are for Java.

Hehe, the "jvm" stands for "SIG JVM group" and not "Warning! This is a Java library!". It was mainly to avoid any clash with existing artifacts that came out of the core repository (i.e. org.tensorflow:tensorflow) and might still come out after we migrate our part (need to check but last news was that Android was still using the Bazel Java build for some of their apps and I don't know for how long they'll continue to do it).

But yeah, we can stay positive and think that it is safe to preserve the org.tensorflow group ID, which I agree is better.

saudet commented 5 years ago

For example, let say you just want support for linux (ubuntu), you can add those individual dependencies: org.tensorflow:tensorflow-java org.tensorflow:tensorflow-native:linux-x86_64

What I do is exactly the same. The only difference is that it would look like this instead:

org.tensorflow:tensorflow-native
org.tensorflow:tensorflow-native:linux-x86_64

The first contains only Java classes that are the same for all platforms, the second, only native libraries for linux-x86_64 platform: Exactly the same thing as you describe. The difference is that they need to be built together, so there is less chance for them to get out of sync. Also, for users, it makes it clear that they are related, while just adding a "java" suffix is pretty redundant as I pointed out previously.

On the other hand, if size does not matter for you, you can simply depends on one of the starter artifacts that automatically includes the "common" versions of the library, like it is done now: org.tensorflow:tensorflow

Right, that's the tensorflow-native-platform I was referring to, but the name doesn't really matter, although having "native" in there makes it clear that it's related to the tensorflow-native ones only.

IMHO, I think simple maven inclusion/exclusion is pretty basic and should fulfill most of our cases, instead of relying on additional build variables. Maybe you can explain a bit more why we would need those with some examples?

Well, there are many TF users that also use OpenCV, FFmpeg, Tesseract, etc, but say we don't care about those, and just think about TF. This library also depends on other libraries like CUDA and MKL, among others. They are not mandatory, but they are still useful for performance, so it makes sense to provide builds that use CUDA, MKL, etc. Now, how do users go about getting those dependencies? We could force them to install them manually, but since the redistributables are available as Maven artifacts, it's also possible to use them like this:

        <dependency>
            <groupId>org.bytedeco</groupId>
            <artifactId>mkl-platform-redist</artifactId>
            <version>2019.4-1.5.2-SNAPSHOT</version>
        </dependency>
        <dependency>
            <groupId>org.bytedeco</groupId>
            <artifactId>cuda-platform-redist</artifactId>
            <version>10.1-7.6-1.5.2-SNAPSHOT</version>
        </dependency>

And users can select at runtime the platforms they are interested in to limit the binaries that need to be downloaded. But you're saying they should be forced to add/remove manually at build time dependencies for all their platforms and for all the transitive dependencies? You can't mean that. This is basic functionality that the Java platform needs if it wants to remain relevant in the enterprise in the long term! I'll eventually create a PR here, but it's basically going to look like what I've done before for the JavaCPP Presets and Deeplearning4j. And we've talked about this before too on the mailing list, etc. So I'm getting the impression that you're concerned about something else that you're not mentioning here. Even if I make a PR and get everything working, I feel it's not going to be acceptable for some reason, or am I wrong?

karllessard commented 5 years ago

For example, let say you just want support for linux (ubuntu), you can add those individual dependencies: org.tensorflow:tensorflow-java org.tensorflow:tensorflow-native:linux-x86_64

What I do is exactly the same. The only difference is that it would look like this instead:

org.tensorflow:tensorflow-native
org.tensorflow:tensorflow-native:linux-x86_64

The first contains only Java classes that are the same for all platforms, the second, only native libraries for linux-x86_64 platform: Exactly the same thing as you describe. The difference is that they need to be built together, so there is less chance for them to get out of sync.

I like that. Then if the artifact name is not only used to hold the native code, maybe we should rename it to tensorflow-core instead, wdyt?

On the other hand, if size does not matter for you, you can simply depends on one of the starter artifacts that automatically includes the "common" versions of the library, like it is done now: org.tensorflow:tensorflow

Right, that's the tensorflow-native-platform I was referring to, but the name doesn't really matter, although having "native" in there makes it clear that it's related to the tensorflow-native ones only.

Got it. And that's the thing, they might contain more than the tensorflow-native (or tensorflow-core) later, I guess we can decide at some point that the Keras API should be part of the common starters as well. So in this case, tensorflow or tensorflow-gpu sounds appropriate.

IMHO, I think simple maven inclusion/exclusion is pretty basic and should fulfill most of our cases, instead of relying on additional build variables. Maybe you can explain a bit more why we would need those with some examples?

Well, there are many TF users that also use OpenCV, FFmpeg, Tesseract, etc, but say we don't care about those, and just think about TF. This library also depends on other libraries like CUDA and MKL, among others. They are not mandatory, but they are still useful for performance, so it makes sense to provide builds that use CUDA, MKL, etc. Now, how do users go about getting those dependencies? We could force them to install them manually, but since the redistributables are available as Maven artifacts, it's also possible to use them like this:

        <dependency>
            <groupId>org.bytedeco</groupId>
            <artifactId>mkl-platform-redist</artifactId>
            <version>2019.4-1.5.2-SNAPSHOT</version>
        </dependency>
        <dependency>
            <groupId>org.bytedeco</groupId>
            <artifactId>cuda-platform-redist</artifactId>
            <version>10.1-7.6-1.5.2-SNAPSHOT</version>
        </dependency>

And users can select at runtime the platforms they are interested in to limit the binaries that need to be downloaded. But you're saying they should be forced to add/remove manually at build time dependencies for all their platforms and for all the transitive dependencies? You can't mean that.

Probably not because I don't understand what you mean. I was not referring to external libraries like Cuda or MKL, I think I was just pointing out that the examples I gave to handle the different combinations for the tensorflow-core, using the standard Maven dependency system, should be sufficient for most cases. I thought you wanted to introduce some more advanced/custom configuration variables at build time to achieve this.

This is basic functionality that the Java platform needs if it wants to remain relevant in the enterprise in the long term! I'll eventually create a PR here, but it's basically going to look like what I've done before for the JavaCPP Presets and Deeplearning4j. And we've talked about this before too on the mailing list, etc. So I'm getting the impression that you're concerned about something else that you're not mentioning here. Even if I make a PR and get everything working, I feel it's not going to be acceptable for some reason, or am I wrong?

You are very wrong, indeed :) For sure, I want to understand what you intend to do and so far, it sounds promising.

Craigacp commented 5 years ago

One thing I'd note is that it would be useful to have Tensorflow builds which include both MKL and CUDA support, as then the CPU fallback from GPU code is less painful. I understand that DL4J has completely separate bindings which can cause a bit of a perf cliff if the GPU doesn't have an implementation of the operation. This might be less of an issue in Tensorflow as it's got better buildout, but it's worth keeping in mind.

jxtps commented 5 years ago

Layout looks great, I'd consider renaming /support to /util or /helpers - support makes my mind go in the wrong direction.

Artefacts: it's sounds like @saudet has a clear picture in mind after all his work on JavaCPP - I've only used it via DL4J and some other projects and appreciate the benefits as a user but I'm not deeply familiar with the internal layout. Maybe there is some primer docs (on bytedeco.org?) that the rest of us should read to better understand the more nuanced mechanics? At present I'm not in a position to provide meaningful feedback on this aspect.

karllessard commented 5 years ago

@jxtps : sure, I thought of utils before as well so I guess it's a better pick

saudet commented 5 years ago

I like that. Then if the artifact name is not only used to hold the native code, maybe we should rename it to tensorflow-core instead, wdyt?

That's true, but all the Java code in there is to support the native code, so I still think "native" makes sense. Right now "core" is used for the parent module in the hierarchy, so having one of its submodules also named "core" might be confusing... (I know naming is not an easy business...)

Got it. And that's the thing, they might contain more than the tensorflow-native (or tensorflow-core) later, I guess we can decide at some point that the Keras API should be part of the common starters as well. So in this case, tensorflow or tensorflow-gpu sounds appropriate.

tensorflow could also be some other module that depends on tensorflow-native-platform and others. We don't need to put everything into the least amount of modules either if it makes sense. tensorflow-gpu, obviously yes. Right now, for example, I've got tensorflow-platform, tensorflow-platform-gpu, tensorflow-platform-python, and tensorflow-platform-python-gpu in the presets, but again it's just a question of naming. We're talking about the same thing. "native" is redundant in the case of the presets because that's all there is, but not in the case of a downstream project where I think it makes things clearer to have names with "native" (I like that one), "jni" (but not so much this one, some people might not get it), etc.

Probably not because I don't understand what you mean. I was not referring to external libraries like Cuda or MKL, I think I was just pointing out that the examples I gave to handle the different combinations for the tensorflow-core, using the standard Maven dependency system, should be sufficient for most cases. I thought you wanted to introduce some more advanced/custom configuration variables at build time to achieve this.

It's sort of "advanced" I suppose, but it doesn't prevent us from doing things the "traditional" way. In other words, there's no drawbacks, which is confirmed by having this scheme used in production in some extremely large corporations (that I'm not sure I'm allowed to mention) for many years already.

One thing I'd note is that it would be useful to have Tensorflow builds which include both MKL and CUDA support, as then the CPU fallback from GPU code is less painful. I understand that DL4J has completely separate bindings which can cause a bit of a perf cliff if the GPU doesn't have an implementation of the operation. This might be less of an issue in Tensorflow as it's got better buildout, but it's worth keeping in mind.

As part of the CUDA build, DL4J now also supports MKL and MKL-DNN. I'm not sure why the official builds for TF still don't have MKL and MKL-DNN enabled, but then again, the official builds also don't bundle CUDA or cuDNN either. We have to use Anaconda for that. It might have something to do with getting people on Google's cloud with TPUs and stuff...

Artefacts: it's sounds like @saudet has a clear picture in mind after all his work on JavaCPP - I've only used it via DL4J and some other projects and appreciate the benefits as a user but I'm not deeply familiar with the internal layout. Maybe there is some primer docs (on bytedeco.org?) that the rest of us should read to better understand the more nuanced mechanics? At present I'm not in a position to provide meaningful feedback on this aspect.

There's some documentation about all this on these wiki pages:

It's obviously not very extensive though, so I would be happy to clarify anything.

karllessard commented 5 years ago

Thanks @saudet , we seem to be pretty much on the page. I'm still not sure about the core vs native though, as "the Java code in there is to support the native code" is not entirely true, there is higher-level APIs as well, like the operation wrappers. Or maybe we could move those in a separate core artifact... Anyway, we can address this later.

But now I would like to know how you visualize the migration to JavaCPP and what would be the next steps. Are you planning to create a PR? I guess there is also a lot a work involved to replace the JNI layer, which unfortunately contains more logic than it should,. Maybe we can start to request help from the SIG members?

saudet commented 5 years ago

Thanks @saudet , we seem to be pretty much on the page. I'm still not sure about the core vs native though, as "the Java code in there is to support the native code" is not entirely true, there is higher-level APIs as well, like the operation wrappers. Or maybe we could move those in a separate core artifact... Anyway, we can address this later.

I still consider the ops basically as wrappers around the native API, but obviously it's not clear cut. Anyway, it's always best to nail the terminology we're going to use as early as possible though. I don't mind "core" as name, I just mind that the parent module is also named the same...

But now I would like to know how you visualize the migration to JavaCPP and what would be the next steps. Are you planning to create a PR? I guess there is also a lot a work involved to replace the JNI layer, which unfortunately contains more logic than it should,. Maybe we can start to request help from the SIG members?

I think I can manage to refactor as necessary within a day or two of work. For starters, it's pretty much just doing what I've done previously elsewhere so it doesn't make sense for someone else to refigure out everything I had to go through previously. :) I just need to find a bit of time which hasn't been obvious the last week for me...

karllessard commented 5 years ago

@saudet : We’ll probably have to stay in sync then because I’m adding right now the NdArray interface to the Tensors, which have quite an impact on the memory allocation and data initialization actually done in the JNI layer, so I suggest that you start with anything unrelated to this.

I intend to propose soon to the SIG major changes on how we handle tensors and datatypes in general.

jxtps commented 5 years ago

When updating the tensors, please deprecate all known slow tensor creation operations so new users don't accidentally shoot themselves in the foot, thanks! (totally ok to have that deprecation be part of a different PR so to speak)

karllessard commented 5 years ago

@jxtps : yes, that was actually part of the plan 👍

Craigacp commented 5 years ago

I think it's a good idea to leave in the Tensor constructor that accepts an n-dimensional Java array. We should extensively document that it's slow to use, and possibly change it's name to reflect that, but there should be an entry point that at least looks like regular Java code.

Some people don't care how slow it is while they are getting the rest of their system up and running, and it's easy to figure out what you feed to a constructor that wants a multidimensional array.

karllessard commented 5 years ago

I’m wondering though if those constructors are of any use in “real life”. Allow me to do some presumptions here:

The two main reasons you make might need to allocate a tensor is either to create a constant for an op or to feed data to a network. In the first case, my feeling is that most constants accepted by TensorFlow operations are of rank 0 or 1. In the second case, I think that if the data you feed is of rank 2 or higher, chances are that your dataset is too big anyway to be initialize declaratively in n-dimensional Java array (e.g. new int[][][]{{{x, y, ...}, {y, z, ......}}, {.........}, {..........}}} ). You’ll probably end up to initialize your data row by row or something like that, something easy to do with the upcoming NdArray API.

So are those constructors mainly useful for writing down quick samples for testing? If so, and if we decide to keep them, yes I guess we should keep them separated from the official Tensor API.

Craigacp commented 5 years ago

I think the feeding data usecase also covers test time inference, and that's what I currently use the Java array entry points for. We expose Tensorflow via a generic model that accepts a sparse vector class which we transform into a dense array (or multidimensional array) before converting it into a predictor. I could have written a JNI wrapper that directly does the conversion from my library's internal representation into a Tensor, but that's much more likely to go wrong than the Java code.

At test time I don't think that the inputs are too big to represent in a Java array, as you might be doing a single inference per call (or a small batch), depending on the larger system.

karllessard commented 5 years ago

@craigacp : Ok, just to clarify, I was thinking keeping the Java single array entrypoints but are you using those n-dimensional entrypoints as well (i.e. int[][], int[][][], ...)?

The only advantage I see of using them, compared to the upcoming NdArray entry point, is how easy it is to initialize the values as you declare the array (i.e. new int[][]{{0, 1}, {2, 3}}). But it only make sense if you sample is small (even for test inference, it might not be the case). If you end up initializing the data of your n-dimensional array outside its declaration, then better use a NdArray?

But again, I intend to propose something better for handling the n-dimensional tensors, came up with a few ideas during my convalescence, I think we'll have more to discuss about that topic in general.

Craigacp commented 5 years ago

Maybe we're talking about different entry points. I was advocating that the current Tensor.create(Object) still accept multidimensional arrays. I agree the NdArray ones are preferable, but we shouldn't remove the current ones till the NdArray API has been in a release.

Anyway, I think I've wandered a bit off topic for a discussion of the repo layout, so I'll stop talking.

karllessard commented 5 years ago

but we shouldn't remove the current ones till the NdArray API has been in a release.

Yes, that I agree

Anyway, I think I've wandered a bit off topic for a discussion of the repo layout, so I'll stop talking.

Well, that brings a interesting topic related to the repos: how much do we want the first release from those repo (i.e. 2.0.0) to be backward compatible with previous versions.

As discussed with @saudet, we’ll pick different artifact names from the current ones and I’m tempted to think that this opportunity to break backward compatibility is unique.

If we decide to do so, then we need to address as much as possible those questions now.

Craigacp commented 5 years ago

Are we tracking tensorflow version numbers? As the next release from this SIG will be quite different from TF 2 in python. Or it'll be in quite a while after we've integrated something keras like.

karllessard commented 5 years ago

I think starting from our new repos, our artifacts version should be independent from whatever Python releases (and even from the C++ core ones).

So the pick for « 2.0.0 » is not to match their features but to indicate that this version is major update that not backward compatible with the previous versions of the Java client (if that’s what we decide to do).

Now that we’ll have our own release cycle, we can decide whatever version fits best. Do you have any suggestion?

Craigacp commented 5 years ago

I'd expect given the marketing push around TF 2 that people would expect the Java TF 2 release to be similar to the python one (e.g. eager by default etc). We'll need to extensively document and market that we've detached the version numbers.

Is there going to be a final 1.x release? As we could put out an 1.x+1 Java release as the first release then work on what TF 2 looks like in Java.

karllessard commented 5 years ago

1.15 should be that last official release of TF in the 1.x digits.

So we could go to 1.16 as our first release but again, if we break backward compatibility, I think a jump to 2.x is justified, in our own terms.

karllessard commented 5 years ago

But thinking of it... 1.14 was not backward compatible with 1.13 Java speaking, so I guess it does not make a huge difference :)

Craigacp commented 5 years ago

Yeah, one day it would be nice to remove that unstable warning from the Java documentation. I don't think we'll be ready to do that in the first release though.

saudet commented 5 years ago

@karllessard Can you add me as a contributor to this repository so I can create branches directly and what not?

karllessard commented 5 years ago

@karllessard Can you add me as a contributor to this repository so I can create branches directly and what not?

Done

saudet commented 5 years ago

@karllessard I created the javacpp/core-api branch with new core-api and core-api-platform modules as potential replacement for the JNI modules. After putting the C API in the org.tensorflow.c_api package, I figured core-api was a better name than native also because native is a Java keyword, so we can't use it inside a package name or a JPMS module name, if we decide to support that at some point.

In any case, I didn't test it heavily, but at least this works on Linux:

<project>
    <modelVersion>4.0.0</modelVersion>
    <groupId>org.tensorflow</groupId>
    <artifactId>helloworld</artifactId>
    <version>2.0.0-SNAPSHOT</version>
    <dependencies>
        <dependency>
            <groupId>org.tensorflow</groupId>
            <artifactId>core-api-platform</artifactId>
            <version>2.0.0-SNAPSHOT</version>
        </dependency>
    </dependencies>
</project>
import org.bytedeco.javacpp.*;
import org.tensorflow.c_api.*;
import static org.tensorflow.c_api.global.tensorflow.*;

public class HelloWorld {
    public static void main(String[] args) throws Exception {
        System.out.println("Running with TensorFlow " + TF_Version().getString());
    }
}
$ mvn clean compile exec:java -Djavacpp.platform.host
Running with TensorFlow 2.0.0-rc0

Barring some potential bugs, it works the same on Mac or Windows. Right now it builds both for the C and Java APIs, but the latter only to get the source code files. It bundles the library built for the C API, and compiles all the JNI and Java files separately outside Bazel.

It's also really simplified down, but it's easy to add more platforms like Android or even iOS (/cc @johanvos), more extensions like -gpu or -python (including the required dependencies), map more native APIs like C++, etc. I think it may very well evolve this way, that is if/when people start showing interest in supporting this or that, we can add more features at that point.

Anyway, what do you think about how it looks at the moment @karllessard?

karllessard commented 5 years ago

Cool, thanks @saudet, I’ll take a look! Do you want to create a PR so I can start reviewing and add my comments?

karllessard commented 5 years ago

Here: just letting you know that this issue has been moved to this new location