Resource and Memory Management

karllessard / tensorflow-java-sandbox

Experimenting on TensorFlow Java client modularization

2 stars 2 forks source link

Resource and Memory Management #3

Open karllessard opened 5 years ago

karllessard commented 5 years ago

This issue is used to collect most of our ideas or proposals to improve the memory management scheme present in the current TF Java client.

Actually, TF Java mostly relies on the try-with-resources approach to release native resources, particularly in graph execution mode.

Eager execution mode releases all resources allocated within a EagerSession when this one is closed, and is also listening to objects discarded by the garbage collector at runtime to free their native counterpart as well while the session is valid.

Can we do better?

saudet commented 5 years ago

Can we do better?

Yes :) In any case, one thing I consider important is interoperability with other libraries. A lot of users use, for example, OpenCV alongside TF in ways similar to what is exemplified in this blog post:

http://bytedeco.org/news/2018/07/17/bytedeco-as-distribution/

JavaCPP already supports PointerScope that way with the C API of TensorFlow as well, and if we could implement other schemes like reference counting in there that users might like, they would automatically become available to all other libraries using JavaCPP.

saudet commented 5 years ago

FWIW, MXNet decided to basically reimplement PointerScope, but in Scala as they are not using the Java language: https://cwiki.apache.org/confluence/display/MXNET/JVM+Memory+Management https://github.com/apache/incubator-mxnet/blob/master/scala-package/memory-management.md Obviously, they've missed the point about interoperability...

karllessard commented 5 years ago

I agree with you @saudet, could make sense if JavaCPP abstract most of the memory management details for layers on top.

Also, for comparison, the PointerScope seems to accomplish pretty much the same job as the EagerSession, except that the later also releases resources as they are discarded by the GC, while still in the « scope ». Is it something we can add also to JavaCPP?

This allows recycling the memory used by unreferenced objects before the scope (or session) is closed.

Craigacp commented 5 years ago

I think binding the API to JavaCPP by surfacing it's types in the public API (by exposing a PointerScope) is going to make it difficult to switch to whatever the next library is which makes native interop simpler.

With eager mode tensors that are returned are likely to be flagged by IDEs as objects which should be closed, so what happens if they are closed? If they are cleaned up by the EagerSession when it goes out of scope then that's fine but will lead to a lot of noise while writing programs using it.

karllessard commented 5 years ago

I think binding the API to JavaCPP by surfacing it's types in the public API (by exposing a PointerScope) is going to make it difficult to switch to whatever the next library is which makes native interop simpler.

Maybe EagerSession, Graph or whatever TF class we expose in the public API can extends from PointerScope to inherit that functionality?

With eager mode tensors that are returned are likely to be flagged by IDEs as objects which should be closed, so what happens if they are closed?

I think graph and eager mode should both use the same memory management scheme. Right now they are distinct because I didn't want to touch existing mechanisms for graphs when I added eager environment support. That being said, in eager mode if you explicitly close a Tensor, then the tensor will be released immediately as expected and the session won't attempt to discard it later.

If they are cleaned up by the EagerSession when it goes out of scope then that's fine but will lead to a lot of noise while writing programs using it.

I'm a bit neutral about this but just an idea like that, if we decide to apply the eager memory management scheme to graphs, then we won't use try-with-resources anymore so Tensor won't need to be (Auto)Closeable. Still, we should let the user release the memory explicitly if they really want to, maybe via another method (e.g. release()) to make IDEs happy?

Craigacp commented 5 years ago

Yeah, if we remove the AutoCloseable interface from it then it will stop being flagged in the IDE (we can leave the close method to allow the user to clean it up manually). The issue is if we're still supporting graph mode then those will need a containing scope (maybe the standard Session object?) which can release them when it goes out of scope. I've been working on an ONNX-runtime Java wrapper which only creates tensors out of an ONNX session object, so that can track them, but we don't create TF Tensors like that at the moment.

We should probably do some testing though as relying upon the GC to free resources when they go out of scope is might cause some issues. Any ML code will probably have very little allocation in the main loop, and the input and output Tensors at inference time are very small objects (because they are basically a pointer), so it's might not create enough GC pressure to trigger a GC. Without a GC it won't clean up the TF Tensors on the native heap, so we could allocate way too much native memory. Reference counting would be significantly more deterministic, but I'm not sure how that would be implemented without the user of the TF API having to understand that the memory was tracked with reference counting.

One further question, @saudet how does the JavaCPP wrapper interact with the ByteBuffer representation of the Tensors? Can JavaCPP convert it's Pointers into a ByteBuffer (or appropriate subtyped Buffer).

karllessard commented 5 years ago

We should probably do some testing though as relying upon the GC to free resources when they go out of scope is might cause some issues.

That’s what I think too, relying on the GC can be unpredictable but we don’t know yet if it’s a real issue in our case. At the time of adding EagerSession, I did a few tests that seems to provide positive results but more extensive testing would be required if we want to apply that scheme overall the library (and to see if object scoping is required at all).

For reference count, @EronWright can you explain a bit your solution? I know it’s not ready and I might be fast-forwarding here, but when I’m looking at this link you provided me, it looks like allocated objects must be unreferenced manually, via unref(), and that forgetting to do it will lead to a memory leak?

IMHO, the Java API should abstract as much as possible that users are dealing with a native library and we should stick to Java idioms, where users don’t have to worry much about memory deallocation. Unfortunately, I don’t know how we can mimic C++ smart pointers in Java in that way.

karllessard commented 5 years ago

One further question, @saudet how does the JavaCPP wrapper interact with the ByteBuffer representation of the Tensors? Can JavaCPP convert it's Pointers into a ByteBuffer (or appropriate subtyped Buffer).

@Craigacp, I know that the question is not addressed to me but I don’t think we have to map JavaCPP pointers to ByteBuffer, as the later only support 32-bits indexation. My plan was to map those pointers directly to the DataBuffer interface, that is used by the upcoming NdArray API.

Craigacp commented 5 years ago

@karllessard sure that would work but will trigger a jni call for every put or get as it's a wrapper around a pointer to the native heap. Which leads to the complexity that ND4J has where it strongly prefers you never pull data into Java for operations, you only work on it with the provided native methods (i.e. you can't map a lambda element-wise over a tensor). From what I understand of a ByteBuffer that doesn't happen as the JVM understands the ByteBuffer so it doesn't trigger a call to jni.

EronWright commented 5 years ago

@karllessard when we talk about reference counting and the need to manually manage the reference count, it is true but there's a few points to make:

An object based on AutoCloseable must also be manually dereferenced (closed), but is limited to a reference count of one.
A reference-counted object should have a finalizer to avoid a leak.
Netty-style reference counting may actually simplify the user code in certain access patterns, e.g. feeding tensors to a call to run(). In this example, observe that the feeds are simpler to work with, and the fetches are properly cleaned up.

karllessard commented 5 years ago

@karllessard sure that would work but will trigger a jni call for every put or get as it's a wrapper around a pointer to the native heap.

Oh I am not aware of the internal details of JavaCPP yet but if it does trigger a JNI call for every I/O operation then yes, I agree, we should map it to a ByteBuffer to avoid it

EronWright commented 5 years ago

@karllessard regarding the use of direct buffers in the NdArray API, I'm concerned about relying only on GC. I have a doubt that the GC does a good job in accounting for direct buffers (e.g. whether they're considered a 'large' object). One often resorts to the use of sun.misc.Unsafe to explicitly clean up.

Check out the Netty project's efforts along those lines, the ByteBuf class which is a reference-counted ~alternative to~ wrapper for ByteBuffer. See UnpooledDirectByteBuf::freeDirect. More info on the use of Netty's buffer library (in a standalone manner), here. Unfortunately it uses 32-bit indexes.

karllessard commented 5 years ago

In this example, observe that the feeds are simpler to work with, and the fetches are properly cleaned up.

@EronWright that is what worries me a bit, the user needs a high knowledge of how the memory is managed to understand what has been freed and what should be freed manually.

In your example, the tensor x is automatically freed after running the session but the only way to understand that it happened is to read the documentation. And I must be careful to unreference y explicitly to avoid a memory leak. BTW, relying on the finalize() method is not well advised neither and it is recommended to use a ReferenceQueue of phantom references instead.

regarding the use of direct buffers in the NdArray API, I'm concerned about relying only on GC.

I'm not 100% convinced neither, that is why I was thinking doing some more extensive testing might give us some answers. But we can still continue to support instance scoping as well (like the EagerSession do) as a failsafe, so instances are GC'd only during the session but are freed automatically when session scope is auto-closed.

Check out the Netty project's efforts along those lines

Very interesting link indeed, looks like we can learn from Netty's experience as well. I'll dig further a bit later but quickly, that's interesting:

"To be reclaimed, java.nio.ByteBuffer relies on the JVM garbage collector. It works OK for heap buffers, but not direct buffers. By design, direct buffers are expected to live a long time. Thus, allocating many short-lived direct NIO buffers often causes an OutOfMemoryError. Also, deallocating a direct buffer explicitly using the (hidden, proprietary) API isn't very fast."

karllessard commented 5 years ago

Throwing ideas here for ref counters: maybe the lifetime of a tensor could be more explicit to the user with some metadata. Like different class names, the factory names or attributes in case the tensor is allocated for a single use or until the user frees it manually?

EronWright commented 5 years ago

Regarding how to simplify the cleanup of tensors, I like the notion of an auto-closeable scope which acts as a tensor factory. If we also implement refcounting, such a scope would simply decrement the refcount of any allocated tensor for easy cleanup. Meanwhile, if one wished to retain a tensor beyond the scope's life, one would simply increment the refcount before exiting the scope. Related: JavaCPP's PointerScope.

I'll also point out a weakness of relying on AutoCloseable for Tensor, the try-with-resources idiom is not well-suited to async-style programming (e.g. CompletableFuture).

Craigacp commented 5 years ago

@EronWright You should have a look at the ByteBuffer replacement that will hopefully arrive in JDK 14 as the first part of the OpenJDK Panama project - https://www.youtube.com/watch?v=r4dNRVWYaZI&list=PLX8CzqL3ArzXFRVYmbvZfZQ0CQMfIIFCH&index=4&t=3s. It's also scoped, and you can extend memory beyond the life of the scope (though I think the API and semantics for that are still being ironed out).

I'd like to try and avoid having any of these types (ByteBuffer, PointerScope etc) appear in the public API if possible, as then when the native memory access work lands we can have a multi-release jar (https://openjdk.java.net/jeps/238) automatically load an API implementation which avoids the JNI overhead or ByteBuffer for native accesses. That jar will still work on Java 8, just using whatever path we decide here, and on 14+ it will use the new JVM features which should make things faster and safer.

saudet commented 5 years ago

Sorry for the delay guys! A lot happening here. Let's see, many things to reply to. :)

First of all, let's agree to not rely on GC for anything. The reason is pretty simple: It is not aware of anything else than on heap memory, and the only way to trigger GC on the JVM is with System.gc(), which is not guaranteed to do anything at all, but when it does run it typically forces a "full GC", which can potentially "stop the world" for several seconds on large heaps (anything more than a few GB). You do not want to do that. The only thing the GC should do is mop up (using PhantomReference and ReferenceQueue, not finalize()) anything that wasn't deallocated some other way, and that's how PointerScope does it. Each instance holds strong references for everything, so nothing happens with GC, unless we forget to call close() that is. However, whether scoped or not, what we can do is call Pointer.close/deallocate(), and it will get deallocated at that point (and not again later so it doesn't cause "double free" errors). How does that sound @karllessard?

To support reference counting, I'm thinking we could implement it by allowing objects to belong to multiple scopes and have them deallocated only when all their scopes are closed. PointerScope already allows users to attach() and detach() objects, but they get deallocated right away when any one of the scopes closes. :-/ I think that can be enhanced though and would be a user-friendly way of doing reference counting. Do you see downsides to that approach @EronWright?

For what it's worth, because of all the limitations mentioned above, I now consider the Buffer API as legacy. We can convert Pointer back and forth to Buffer with new BytePointer(ByteBuffer buffer), etc constructors and asBuffer() methods though. To support fast access with 64-bit indexes, I created the indexer package, see http://bytedeco.org/news/2014/12/23/third-release/ for an example. It supports types like half-floats and bfloat16 too, and ND4J relies on it as well. (BTW @Craigacp, the reason why ND4J doesn't support lambda ops written in Java is because there wasn't a nice way to implement a callback from CUDA, but that's being worked on.) Indexers basically use sun.mic.Unsafe when available, but fall back on java.nio.Buffer when not available on Android, etc. That's how Aeron does it too. It's a pretty standard approach actually, and I'm just trying to standardize the API somewhat since OpenJDK has not...

@Craigacp I also think it's a good idea if we can find a way to not expose the API of JavaCPP, yes. Let's think of how we could have EagerSession, etc use PointerScope or whatever else without exposing users to the Pointer objects and such. Would you have something to propose? BTW, if you're referring to MemorySegment from Panama, that only supports off-heap memory. It doesn't support other types of resources like GPU memory, so I don't think it's something that is going to be useful for TF. I'd really like Panama to work on APIs that we can use with libraries like TF, but after 5 years now, I've given up on trying to make them understand the importance of this. As I mentioned previously though, if you do know of a way to convince them though, that would be great.

karllessard commented 5 years ago

@saudet , I'm comfortable with what you are suggesting but please let me add some clarifications on my initial proposal.

First of all, let's agree to not rely on GC for anything.

Ok, I understand that the GC does not keep track of the native memory and therefore we cannot only rely on it to release our resources, that is clear. But the way I'm suggesting to use it (and how EagerSession does it) is more as a fallback case that allows us to free important amounts of objects allocated on the heap and backed by a native resource before the scope is closed.

I ran some tests here to help understanding:

    try (EagerSession session = EagerSession.create()) {
      Ops tf = Ops.create(session);

      Operand<Float> value = tf.constant(0.0f);
      Constant<Float> inc = tf.constant(1.0f);

      for (int i = 0; i < 100000; ++i) {
        Add<Float> sum = tf.math.add(value, inc);
        assertThat(sum.asOutput().tensor().floatValue() == value.asOutput().tensor().floatValue() + 1.0f);
        value = sum;
      }
  }

Every time we fetch the result of an eager operation, like sum.asOutput().tensor().floatValue() above, a Tensor object gets allocated. Also, the operation itself (sum) is backed by a native resource that must be released (even if it does not hold a buffer, like the tensor does).

By adding some debug statements, running this code gave me this output:

Resources cleared by garbage collection: 136396
Resources cleared by session close: 63607

Now if I only rely on the scope, which is actually feasible by allocating the EagerSession like this:

 try (EagerSession session = EagerSession.options()
          .resourceCleanupStrategy(ResourceCleanupStrategy.ON_SESSION_CLOSE)
          .build()) {
      Ops tf = Ops.create(session);

      Operand<Float> value = tf.constant(0.0f);
      Constant<Float> inc = tf.constant(1.0f);

      for (int i = 0; i < 100000; ++i) {
        Add<Float> sum = tf.math.add(value, inc);
        assertThat(sum.asOutput().tensor().floatValue() == value.asOutput().tensor().floatValue() + 1.0f);
        value = sum;
      }
  }

I obtain (as expected) the following output:

Resources cleared by garbage collection: 0
Resources cleared by session close: 200003

So by listening to the garbage collector, we could release automatically a huge amount of obsolete resources even before the session has closed. But again, it only works if you have a lot of objects allocated on the heap, otherwise the GC won't kick in.

Also note that in this example, we couldn't start a new scope for each iteration of the loop (which might not have been efficient anyway) ~nor release the tensors manually because the same tensor is reused in the next iteration as value.asOutput().tensor()~ see addendum below.

That being said, if everybody thinks we should exclusively rely on reference scoping, I'll go with the group but please keep in mind eager execution when you think about a solution because it brings way more challenges that in graph mode.

Addendum: Finally I could have close the resources manually like this with the current API, with the additional complexity:

    for (int i = 0; i < 100000; ++i) {
      Add<Float> sum = tf.math.add(value, inc);
      Tensor<Float> previousResult = value.asOutput().tensor();
      assertThat(sum.asOutput().tensor().floatValue() == previousResult.floatValue() + 1.0f);
      previousResult.close();
      // this is not supported but could be, to release the previous `sum` operation and maybe implicitly its tensor, using reference counts...
      // value.close();
      value = sum;
    }

saudet commented 5 years ago

@karllessard Ah, I see what you're worried about now. The idea with "scopes" is to replicate as much of what we can do in C++ with what's being called "RAII" these days, because that's what native libraries rely on. There are even libraries like HDF5 that are picky about the order in which their objects are deallocated, where if x is allocated before y, then x must be deallocated after y, or it crashes the whole process! That's not something we can reproduce with GC. To be clear, in C++ we could have:

{
    SomeClass x,y,z;
    // ...
    {
        SomeClass a,b,c
        // ...
        // a,b,c get deallocated here
    }
    // x,y,z get deallocated here
}

We can do something very similar in Java with try-with-resources and PointerScope, since they get pushed to a thread-local stack, similar to how C++ does it:

try (var s = new PointerScope()) {
    SomeClass x = ..., y = ..., z = ...;
    try (var s2 = new PointerScope()) {
        SomeClass a = ..., b = ..., c = ...;
    }
}

And thinking more about how we could abstract these details away from users, as well as make this more usable across libraries, we could create more specific scopes too, for example, a TensorScope, and have something like PointerScope just be the "catch-all" scope when JavaCPP is used as "back end". It could look like this:

try (var ps = new PointerScope()) {
    try (var ts = new TensorScope()) {
        Mat m = ... // OpenCV Mat -> gets attached to the "global" PointerScope
        Tensor t = ... // TF Tensor -> gets attached to the "local" TensorScope
        // ...
        // t gets deallocated here
    }
    Tensor t2 = ... // gets attached to the "global" PointerScope
    // t2 get deallocated here
    // m get deallocated here
}

@Craigacp Would you be OK with that?

For that we're going to need another class than EagerSession, which could still be a scope in itself though, but if the idea is to let users create standalone Tensor objects, then we're going to need something more anyway.

Craigacp commented 5 years ago

@saudet I think that a TensorScope seems like a good idea, to mirror the EagerSession? I would prefer not to have PointerScope be part of the API, the same way that Panama's MemorySegment shouldn't be in there. I'm a little confused by PointerScope supporting GPU memory, I wasn't aware that CUDA always mapped GPU memory into the CPU's address space such that it was visible from a CPU instruction like those emitted by JavaCPP. Does JavaCPP emit CUDA deallocation instructions to free the pointers?

@karllessard One thing I'd keep in mind is that your numbers are a function of the memory size that you gave the JVM, as if you set Xms and Xmx then you probably wouldn't observe any gc runs during your test loop. Provided it's not the only way of cleaning up tensors (i.e. they have an optional close method) then I've no issue with having an extra codepath to clean things up when gc happens. As the others said relying upon gc does induce variability (e.g. OpenJDK has 4 or 5 gc algorithms, and changed the default between 8 and 9. OpenJ9 has yet another set of gcs, and Azul also has a different algorithm), but I think in many cases what matters is not leaking memory rather than having completely deterministic behaviour on when it's freed (provided it's freed by point X, which is true of the EagerSession).

I think my preference would be to have things be AutoCloseable in graph mode and use the EagerSession to clean up tensors in eager mode, but that causes issues as many static analysers and IDEs complain if you don't close AutoCloseable resources and I'd prefer that all code that used TF not generate a bunch of warnings. Having a closable scope of some kind (either a TensorScope in graph mode or EagerSession) which cleans up the tensors seems like a good compromise along with an optional close method if the user wants memory freed immediately, but I'd prefer the EagerSession and TensorScope to produce the Tensors where possible, so it's obvious which one scope a Tensor binds to.

saudet commented 5 years ago

@saudet I think that a TensorScope seems like a good idea, to mirror the EagerSession? I would prefer not to have PointerScope be part of the API, the same way that Panama's MemorySegment shouldn't be in there. I'm a little confused by PointerScope supporting GPU memory, I wasn't aware that CUDA always mapped GPU memory into the CPU's address space such that it was visible from a CPU instruction like those emitted by JavaCPP. Does JavaCPP emit CUDA deallocation instructions to free the pointers?

In this case, something like PointerScope wouldn't be part of the API, it would be used by TensorScope to interoperate with it. We can plug in any kind of deallocators we want. It's basically what they've added in JDK 9 with the Cleaner API. See, for example, this guy here calling TF_DeleteBuffer(): https://github.com/karllessard/tensorflow-java/pull/2/files/d26fee923ed57cc482dfa3b365e0e4218e1ce10e#diff-de98a3a669dee952de626f83db0858f0 FWIW, this kind of API really needs to be in the JDK, but OpenJDK hasn't been too nimble since JDK 6 or so, and I'm just doing the next best thing, trying to show that standardizing things like that is actually useful.

Pointer is just a fancy container for a 64-bit integer, so it can point to GPU memory. Panama's Pointer is the same for that matter, but the issue with Panama's Pointer is that all the surrounding framework was designed only for C (not C++) and CPU memory only (like the Scope class), so it's only barely usable for the simplest libraries and APIs. I suppose we should be able to set the fields of MemorySegment at will with JNI or something as with Buffer, but I'm not finding anything about this. I think Panama is assuming people will eventually create MemorySegment out of Pointer, but until the latter becomes part of the JDK, it looks like we're going to need to allocate all memory with MemorySegment itself, which means that we won't even be able to use it to access host memory that was allocated and pinned by CUDA! That sounds bad, real bad, but it's the kind of mediocrity that I've come to expect from Panama... I have much more hope in the GraalVM team: https://blogs.nvidia.com/blog/2019/09/12/oracle-openworld/

I think my preference would be to have things be AutoCloseable in graph mode and use the EagerSession to clean up tensors in eager mode, but that causes issues as many static analysers and IDEs complain if you don't close AutoCloseable resources and I'd prefer that all code that used TF not generate a bunch of warnings. Having a closable scope of some kind (either a TensorScope in graph mode or EagerSession) which cleans up the tensors seems like a good compromise along with an optional close method if the user wants memory freed immediately, but I'd prefer the EagerSession and TensorScope to produce the Tensors where possible, so it's obvious which one scope a Tensor binds to.

Keep in mind that there's other resources than Tensor that need to be managed that way, so we'll probably need more than just that. In any case, the thing with something like TensorScope is that it covers objects returned from methods too. Here's how Karl's example above could look like:

for (int i = 0; i < 100000; ++i) {
    Add<Float> sum = tf.math.add(value, inc);
    try (var ts = new TensorScope()) {
        assertThat(sum.asOutput().tensor().floatValue() == value.asOutput().tensor().floatValue() + 1.0f);
    }
    sessionScope.detach(value); // I guess?
    value = sum;
}

Consequently, it's not something that can be always used as a factory pattern. Yes, your IDE will complain about it too, but I think Java should really allow constructs like the following:

    try (new TensorScope()) {
        // ...
    }

When people start realizing the advantages of this, I'm sure that will come to be...

karllessard commented 5 years ago

In this case, something like PointerScope wouldn't be part of the API, it would be used by TensorScope to interoperate with it.

@saudet, here what I suggest for the class hierarchy. Right now, there is two classes that act as a resource manager and those are good candidates to remain auto-closeable: EagerSession and Graph. Both implements the interface ExecutionEnvironment. Maybe we should change that interface to a base class that interacts with PointerScope? So then, we won't need an additional TensorScope.

Note: I tend to think that Graph should be renamed to GraphSession and that the Graph instance should be issued from the session instead of the opposite like it is now, but that's another topic

So instead of creating new smaller TensorScope like in your previous snippet, do you think of the following as a potential workaround?

try (EagerSession session = EagerSession.create()) {
    for (int i = 0; i < 100000; ++i) {
        Add<Float> sum = tf.math.add(value, inc);
        assertThat(sum.asOutput().tensor().floatValue() == value.asOutput().tensor().floatValue() + 1.0f);       
        session.clear();  // optional, acts like kind of an `System.gc()` but for native resources and that we can control
        value = sum;
    }
}

Now if it is not possible to move the GC "best-effort" logic to JavaCPP, I would like to keep it at least in the EagerSession because I find it useful to clean up automatically small resources, such as the orphan operations in my example (sum). We will need to find a way to interact somehow with the underlying PointerScope so it can continue to work properly.

UPDATE: Not sure about my previous snippet neither, as session.clear() could release tensors out of the loop as well, which might still be in use. So I guess closing explicitly those tensors with the close() method or having the ability to define a smaller scope using TensorScope within an EagerSession could do it then.

Craigacp commented 5 years ago

@saudet I think Java is unlikely to grow the kind of scope that you mention at the end as it's hard to check statically. You'd need to issue a compile time error if there was no enclosing scope, but that requires some magic in the class that's a child of that scope to indicate that it only works inside a scope. Java's mechanism for doing that is by making a child object out of the scope so the scope needs a name.

MemorySegment should be able to wrap other memory that was allocated on the native heap, it just uses a different creation path than the standard method. Panama's Pointer will be reimplemented on top of the memory segment API. In JavaCPP if you can wrap GPU pointers how do you prevent segfaults generated by Java CPU code dereferencing a GPU pointer?

Java as a platform moves carefully as they have to bring along millions of developers. GraalVM has been moving a little quicker as they are new, and also they aren't modifying the Java platform (i.e. language and class libraries). Both teams collaborate extensively, and GraalVM itself is based on Hotspot.

saudet commented 5 years ago

@saudet, here what I suggest for the class hierarchy. Right now, there is two classes that act as a resource manager and those are good candidates to remain auto-closeable: EagerSession and Graph. Both implements the interface ExecutionEnvironment. Maybe we should change that interface to a base class that interacts with PointerScope? So then, we won't need an additional TensorScope.

We can certainly do that. I'll leave you guys debate the high-level details. :) However, if we do that, we won't be able to do the kind of thing that @EronWright wants to do, that is, for example, sharing the same Tensor instance between sessions. TensorFlow itself allows that. If one day I do need to do that kind of thing that the Java API doesn't allow, I'll knock on your door, but it's not something on my laundry list at the moment.

Now if it is not possible to move the GC "best-effort" logic to JavaCPP, I would like to keep it at least in the EagerSession because I find it useful to clean up automatically small resources, such as the orphan operations in my example (sum). We will need to find a way to interact somehow with the underlying PointerScope so it can continue to work properly.

UPDATE: Not sure about my previous snippet neither, as session.clear() could release tensors out of the loop as well, which might still be in use. So I guess closing explicitly those tensors with the close() method or having the ability to define a smaller scope using TensorScope within an EagerSession could do it then.

That could also be where PointerScope that counts references with attach() and detach() comes into play. On a call like session.detach(tensor), if that tensor is attached to no other scope/session, then it gets deallocated, by default, but since we might not want to deallocate it too, we could also have an overload that doesn't deallocate.

@saudet I think Java is unlikely to grow the kind of scope that you mention at the end as it's hard to check statically. You'd need to issue a compile time error if there was no enclosing scope, but that requires some magic in the class that's a child of that scope to indicate that it only works inside a scope. Java's mechanism for doing that is by making a child object out of the scope so the scope needs a name.

We don't need a scope. If we call say new Tensor() outside of any scope, then it stands alone and there's absolutely nothing wrong with that. Sure, it's not exactly the kind of thing we need for 100% pure Java code, but we're not working with 100% pure Java code! Native libraries do need a mechanism like this, that Python provides, that Swift provides, but that Java does not provide. That reminds me of people who used to be scared of exceptions in Ada or SmallTalk because languages like Pascal and C had no such beast. I'm sure you'll get over your discomfort eventually.

MemorySegment should be able to wrap other memory that was allocated on the native heap, it just uses a different creation path than the standard method. Panama's Pointer will be

Where is this documented? Why do you keep that information secret?

reimplemented on top of the memory segment API. In JavaCPP if you can wrap GPU pointers how do you prevent segfaults generated by Java CPU code dereferencing a GPU pointer?

It doesn't. How does MemorySegment handle this? Please provide information! It must be my imagination, but people from Oracle have a tendency to keep everything secret. If you want others to use stuff like MemorySegment, you're going to need to make at least some amount of information public instead of just keeping everything secret!

Craigacp commented 5 years ago

@saudet it's not secret, Maurizio talked about it a little in his JVMLS talk (https://youtu.be/r4dNRVWYaZI) and the questions afterwards, and you can make a segment out of a ByteBuffer as documented in the Javadoc and other writeups on the public OpenJDK mailing lists. You can check out the branch and see what it's capable of, it's all developed in the open.

MemorySegment doesn't do sensible things with GPU pointers, I was just pointing out that JavaCPP doesn't either, when you said earlier that it's Pointer was capable of working with them. I can store GPU pointers in a long but that doesn't mean that they can be used from Java.

I'm also not saying that Java shouldn't gain any particular functionality, just that it doesn't mesh with Java's current abstractions. I'd like scopes that are visible to the nested code as it makes it easier to implement autodiff with a gradient tape, but it's not straightforward to modify Java to add. The language evolves over time, but deliberately as it has millions of users of varying skills and the people who work on the language design are extremely good at their jobs (and have most of those discussions on the public mailing lists, check one of the OpenJDK "-spec-experts" lists to see the kind of thing I mean). I've asked for features before and there are well reasoned answers for why that wouldn't fit with Java, or the steps necessary before that feature can be tackled.

Anyway, I think this argument is distracting from the main purpose of this thread, so I'm going to stop talking about this now. We seem to be in agreement about the shape of the resource management API, so I'll only comment on that from now on.

saudet commented 5 years ago

@saudet it's not secret, Maurizio talked about it a little in his JVMLS talk (https://youtu.be/r4dNRVWYaZI) and the questions afterwards, and you can make a segment out of a ByteBuffer as documented in the Javadoc and other writeups on the public OpenJDK mailing lists. You can check out the branch and see what it's capable of, it's all developed in the open.

So it relies on the Buffer API for that? I thought you were talking about something to replace that API. Are you saying we can't use MemorySegment without Buffer?

That's the kind of thing I'm talking about. I'm not finding any useful information about it! If you do not have any more information, just say so, and let's stop arguing about something that's probably not going to be usable anytime soon. My discussions on the OpenJDK mailing lists always end up that way too with them just ignoring my questions. Please, just say "I do not know", and let's not mention MemorySegment again unless we have more information! Or anything else for which we do not have enough actionable information for that matter.

Craigacp commented 5 years ago

@saudet the javadoc is here (http://cr.openjdk.java.net/~mcimadamore/panama/memaccess_javadoc/jdk/incubator/foreign/package-summary.html) and the source is here (http://hg.openjdk.java.net/panama/dev/shortlog/c3d683123981). I gave you a link to the most recent talk about the current state where wrapping off heap pointers was discussed, I was in the audience for that talk and asked questions about it. There may be also talks about it at Code One which is happening this week, I haven't checked.

saudet commented 5 years ago

@Craigacp Ok, good to know you've already asked them about it. I didn't take the time to listen to the whole video, so that's cool. I mean, there is a public constructor here, but the name of the class doesn't give me the impression that this is supposed to be part of the public API: https://github.com/openjdk/panama/blob/foreign-memaccess/src/jdk.incubator.foreign/share/classes/jdk/internal/foreign/MemorySegmentImpl.java#L55 It looks like something useful, if they get it right, but I feel like their priorities are elsewhere and I'm always left scratching my head about what they're trying to do...

BTW, about the PointerScope vs TensorScope debate, I've just realized that we don't actually need to implement anything else than what the graph/session/ExecutionEnvironment objects are already doing. For "power users" like myself and @EronWright, if the C API of TF is accessed with JavaCPP in the background, we'll be able to use PointerScope anyway for anything we want, and that's perfect for me!

karllessard commented 5 years ago

BTW, about the PointerScope vs TensorScope debate, I've just realized that we don't actually need to implement anything else than what the graph/session/ExecutionEnvironment objects are already doing. For "power users" like myself and @EronWright, if the C API of TF is accessed with JavaCPP in the background, we'll be able to use PointerScope anyway for anything we want, and that's perfect for me!

yep, that's what I think too. And I think that for the "power-users" who want control themselves the deallocation of their tensors, @EronWright reference counting could provide a very flexible way for doing it. I would still have a TensorScope that wraps up (or inherit) a PointerScopejust to make it more explicit to the tensor domain (unless we want to apply it all kind of resources).

saudet commented 5 years ago

When thinking about how to implement high-level classes like TensorScope, I came up with the idea of letting users specify the classes that a PointerScope should be allowed to be used with, so we can create an instance like new PointerScope(TF_Tensor.class, ...) that would accept only those classes. If it's not something that we need as part of the high-level API, it wouldn't be too hard to use anyway.

karllessard commented 5 years ago

Here: just letting you know that this issue has been moved to this new location

saudet commented 5 years ago

For reference as well, I've continued that discussion on the mailing list: https://groups.google.com/a/tensorflow.org/forum/#!topic/jvm/FyXjlFmmTJI TL;DR version: I added Netty-like referencing counting to JavaCPP's Pointer/PointerScope so we can base a high-level API on that.

saudet commented 4 years ago

@saudet I think Java is unlikely to grow the kind of scope that you mention at the end as it's hard to check statically. You'd need to issue a compile time error if there was no enclosing scope, but that requires some magic in the class that's a child of that scope to indicate that it only works inside a scope. Java's mechanism for doing that is by making a child object out of the scope so the scope needs a name.

@Craigacp Look at that! It's actually "growing" one! :) http://cr.openjdk.java.net/~rpressler/loom/loom/sol1_part2.html#scope-variables I think we can say that validates pretty strongly the idea behind PointerScope.