imagej / pyimagej

Use ImageJ from Python
https://pyimagej.readthedocs.io/
Other
465 stars 81 forks source link

Passing file-like object to pyimagej #156

Open oeway opened 2 years ago

oeway commented 2 years ago

Hi, I am wondering if it's possible to support passing a file-like object with from python to imagej. The goal is to supporting lazy loading of large files that potentially backed by remote storage backend, for example using fsspec to create virtual file object from s3 server. I haven't try it, but would Java operate on the python file correctly? Say using bioformats.

ctrueden commented 1 year ago

@oeway I do not think JPype supports automagically doing anything with file-like objects on the Java side. However, I do think we could support this fairly easily. The SciJava Common library has an org.scijava.io subsystem with Location and DataHandle abstractions, that let you code up random access however you want. So it should be doable to implement those two interfaces on the Python side using JPype's @JImplements, and then feed instances of such locations to API that consumes them, such as SCIFIO's DatasetIOService for opening images as net.imagej.Dataset.

Is this still something you are interested in doing? If so, we can leave this issue open, and I could try to prototype something to help you. Or if this idea is no longer relevant to your project directions, we could close it for now. What do you think?

Thrameos commented 1 year ago

This has been mentioned before that one can implement a InputStream, OutputStream, or InputBuffer in Java backed by Python file handles and add converters to JPype. I just don't think anyone had enough interest to write it and the required test bench to make it a feature.

ctrueden commented 1 year ago

@Thrameos Thanks. This is something my team could try contributing to JPype, if you know of a good built-in random access interface in the Java standard library? We didn't know of one when we invented org.scijava.io.handle.DataHandle, which extends both DataInput and DataOutput for streaming, as well as adding random access style methods. The latter were largely copied from java.io.RandomAccessFile, which unfortunately is not an interface. There is also java.nio.ByteBuffer, but it's also not an interface. Neither are java.io.InputStream or java.io.OutputStream, although the DataInput and DataOutput interfaces are implemented by several of their concrete subclasses respectively.

What do you think would make the most sense, in terms of class hierarchy, for JPype to offer here?

oeway commented 1 year ago

Hi, thanks for looking into this. It would be really useful for supporting sending N5 or NGFF image url to the server and load it as a virtual file object in python (a python object which translate file access to HTTP requests, potentially with range header), then we pass this file object to pyimagej. I think this would be very useful in general for pyimagej.

Thrameos commented 1 year ago

My guess would be SeekableChannel would be best for random access data from the nio library.

https://docs.oracle.com/javase/7/docs/api/java/nio/channels/SeekableByteChannel.html

It is an interface has random access with seek, read, write, and size which is the minimum for any seekable object. It should be possible to make this available in JPype.

The older Java IO library components are not particularly flexible and often have large requirements in terms of methods that could make it prohibitive in terms of implementation. After all if Python can't satisfy one requirement well, then the implementation gets relegated to application layer rather than general use, which is largely why this one tends to get pushed down the road.

Looking over your interface it looks like most of the behaviors could come from a SeekableByteChannel, though there may be some issues. NIO libraries don't have a lot of methods for individual pulls of data, as they split the data interpretation from the IO concept so you will be getting much lower performance then if you do a mass pull using a direct byte buffer (which can be mapped into both Python and Java). Also some concepts are above the level of the Java concept like checksum. So you would need to double wrap it using the JPype PythonSeekableByteChannel to get a Java implementation and then build your DataHandle over it.

With regards to requiring an interface, that is only a requirement if the implementation is to be a full Python implementation without C support which will use @JImplements. For certain support classes it is better (or required if it isn't an interface), to implement a stub in org.jpype and then attach the hooks to the Python C API directly. For something like an IO interface where you could hit many thousands of times, the JNI interface is likely going to be a much better performer. Thus so long as the interface or implementation is part of the Java standard I have no issue including it in JPype if it is full featured.

If SeekableByteChannel is something you would be able to support, I can throw together a sample code pretty easily. I can also give a shot at RandomAccessFile, but that will be very performance limited as individual reads proxing through Python without buffering are likely bottleneck quickly. Buffering it would make if faster but it would likely lose general purpose as only the Java side would have the buffering and you could get Python and Java out of synch.

Thrameos commented 1 year ago

I looked into this further with regards to giving a high performance (mostly native) backend between SeekableByteChannel and Python IOBase.

One the JPype side there is minor infrastructure changes to expose JPProxyType so that it is easier to implement the native methods. Nothing drastic there.

To connect Python IOBase to SeekableByteChannel is okay in parts and sticky in others. Seek, tell, truncate, close are all easy to hook up. Read and write are possible but they lack the concept of a current pointer so it won't be a direct plumbing to Python readinto and write. I think that I can slice a ByteBuffer memoryview into the section with the correct start pointer to make it obey Java conventions but it won't be a one night wiring because I have to trial and error until I find a workable pattern. More dreadful is wiring up all the different exceptions that can be produced. I may need help getting through a workable test bench.

Thrameos commented 1 year ago

Unfortunately it appears there is a bug in JPype memoryview with direct byte buffers that prevents it from working properly with Python io library. I will have to fix it and push a new release before we can make progress on this feature.

I was able to get the slower non-direct path working. Will post when I get more testing.

Thrameos commented 1 year ago

This is the slow implementation. Unfortunately it copies memory 3 times (Python, Python->byte[], byte[]->ByteBuffer). Hopefully, I can make a faster version for direct ByteBuffer once I have the JPype buffer bug fixed.

import jpype
import jpype.imports
jpype.startJVM()

from java.nio.channels import SeekableByteChannel, ClosedChannelException
from java.nio import ByteBuffer
from java.nio.file import Files, Paths, StandardOpenOption

@jpype.JImplements(SeekableByteChannel)
class PySeekableByteChannel(object):
    def __init__(self, fd):
        self.fd_ = fd

    def __enter__(self):
        return self

    def __exit__(self, a, b, c):
        self.fd_.close()

    @jpype.JOverride
    def truncate(self, pos):
        if self.fd_.closed:
            raise ClosedChannelException()
        self.fd_.truncate(pos)

    @jpype.JOverride
    def size(self, pos):
        if self.fd_.closed:
            raise ClosedChannelException()
        return self.fd_.size()

    @jpype.JOverride
    def position(self, *args):
        if self.fd_.closed:
            raise ClosedChannelException()
        if len(args)==0:
            return self.fd_.tell()
        self.fd_.seek(args[0])

    @jpype.JOverride
    def write(self, buf):
        # slow copy write all bytes
        # FIXME JPype has a bug that prevents a direct copy
        if self.fd_.closed:
            raise ClosedChannelException()
        u = bytearray(buf.array()[buf.position():buf.limit()])
        return self.fd_.write(u)
        # FIXME all other exceptions go to IOException

    @jpype.JOverride
    def read(self, buf):
        # slow read to bytes then copy to Java
        # FIXME JPype has a bug that prevents a direct copy
        if self.fd_.closed:
            raise ClosedChannelException()
        pos = buf.position()
        lim = buf.limit()
        b = bytearray(lim-pos)
        m = memoryview(b)
        f = self.fd_.readinto(m)
        buf.put(b, 0, f)
        # FIXME all other exceptions go to IOException

    @jpype.JOverride
    def close(self):
        self.fd_.close()

    @jpype.JOverride
    def isOpen(self):
        return not self.fd_.closed

with PySeekableByteChannel(open("test.txt","rb")) as sbc:
    bb = ByteBuffer.allocate(100)
    bb.position(20)
    bb.limit(35)
    sbc.read(bb)

with PySeekableByteChannel(open("test2.txt","wb")) as sbc:
#with Files.newByteChannel(Paths.get("test2.txt"), StandardOpenOption.CREATE, StandardOpenOption.WRITE) as sbc:
    bb = ByteBuffer.allocate(100)
    bb.position(20)
    bb.putChar("a")
    bb.rewind()
    bb.position(20)
    bb.limit(60)
    print(sbc.write(bb))
Thrameos commented 1 year ago

@ctrueden is SeekableByteChannel usable for you? It is indexed by a long so it should be able to go over the 2 GByte limit.

In fact, why not present Python foreign memory as a SeekableByteChannel? Nothing says that this view has to be a file and it is supported in all versions of Java after 1.7 so we don't need to depend on new features.

ctrueden commented 1 year ago

@Thrameos I wasn't familiar with it before you brought it up on this thread, but the interface looks like a good match for Python file handles. On the SciJava side it would certainly be doable to make e.g. a DataHandle implementation backed by any SeekableByteChannel. My main concern would be with possible performance issues when doing things like reading data one word at a time. But we can always do something similar to BufferedInputStream/BufferedReader if there isn't a more elegant way already with Java's NIO channels mechanism (I don't know it well).

An even more natural mapping from Python file handles to Java might be FileChannel, which is an abstract class implementing SeekableByteChannel and other things. But I don't know if it's worth the headache to sort through that quite larger API determining whether it will all work with Python and ideally unit testing all of it.

Thrameos commented 1 year ago

As least for the implementation I posted if it backed by an io.BufferedIOBase then you should have full buffering on whatever the page size. So local access should be quick even if you pull small amounts, but pure random access will still be slow. Thus you shouldn't have to buffer yourself if the Python backer is already buffered.

SeekableByteChannel is of course best if you use it to take out a decent section of a ByteBuffer at a time. The wrapper could do local paging. That was the problem with the old Java IO as it put a lot of burden on the input class so often it did too much.

I could try to wrap to FileChannel but it is as you can see a much bigger API. As an interface for your library, I would suggest using SeekableByteChannel as much as you can and FileChannel only if needed. After all I can complete the full contract for SeekableByteChannel when backed by a file, memory, or even other abstractions like a network backed URI, but I can only complete a portion of the File contract and then only for files. Choosing the lowest common interface would give you the greatest generality.

Thrameos commented 1 year ago

So here is the proposal.

JPype adds an implicit conversion for io.BufferedIOWriter and io.BufferedIOReader to either FileChannel or SeekableByteChannel.

JPype adds an explicit conversion for bytearray, memoryview, numpy.array, or other Python buffer objects to SeekableByteChannel.

I could make the latter set implicit but that seems like unexpected behavior.

ctrueden commented 1 year ago

@Thrameos Sounds excellent. For the explicit conversion: are you thinking it would be a new method of jpype.nio?

Thrameos commented 1 year ago

No, I am planning to use augmentation of the Java API rather than adding new methods. So just like we added new methods to java.lang.Thread with attach and attachAsDaemon, I can add new methods to existing Java classes to make them Python friendly.

I would guess that either the cast operator or an explicit ctor will work.

from java.nio.channels import SeekableByteChannel
import numpy as np
big = bytearry(3*(2**31))  # 3 G of memory  (Yes, this actually works)
obj.call( SeekableableByteChannel(big))  # Make it so Java can work with a large piece of memory.

At least that makes it clear you are constructing a new view on an existing object.

I could also use SeekableByteChannel @ big or SeekableByteChannel.of(big). Any preferences?

(This will make it so we can handle files, memory, network, and shared memory with one common API, so long as the user is willing to use the nio abstraction to access it.)

Thrameos commented 1 year ago

At the same time I will likely make it so ByteBuffer can use the same syntax.

from java.nio import ByteBuffer
import numpy as np
small = bytearry(1024)  
obj.call( ByteBuffer(small))  # instead of jpype.nio.convertToDirectBuffer

don't worry the old one wont go away (except in the docs)

Thrameos commented 1 year ago

Still working on this one. Unfortunately it looks like given that FileChannel is concrete and has many methods which are very Java exclusive the best I can do send all of them to SeekableByteChannel. Interfaces are far easier to deal with and the level of work to try to get Python object to support the Java locking model is very high.

Once we have them as SeekableByteChannels we can look to seeing if there is additional functionality that can be safely exposed.