glencoesoftware / raw2ometiff

Raw format to OME-TIFF converter
GNU General Public License v2.0
45 stars 20 forks source link

Large 3D volumes hang on RandomAccessFile.setLength #49

Closed joshmoore closed 3 years ago

joshmoore commented 3 years ago

Generating an OME-TIFF from an IMS that's 1024^3 (t=512, c=2) hangs on setLength. Changing the ending (e.g., to .ome.btf) doesn't seem to change the behavior.

"main" glencoesoftware/bioformats2raw#1 prio=5 os_prio=0 tid=0x00007fb91000c000 nid=0x15 runnable [0x00007fb915852000]
   java.lang.Thread.State: RUNNABLE
        at java.io.RandomAccessFile.setLength(Native Method)
        at loci.common.NIOFileHandle.setLength(NIOFileHandle.java:229)
        at loci.common.AbstractNIOHandle.validateLength(AbstractNIOHandle.java:87)
        at loci.common.NIOFileHandle.writeSetup(NIOFileHandle.java:659)
        at loci.common.NIOFileHandle.writeLong(NIOFileHandle.java:597)
        at loci.common.RandomAccessOutputStream.writeLong(RandomAccessOutputStream.java:326)
        at loci.formats.tiff.TiffSaver.writeIntValue(TiffSaver.java:968)
        at loci.formats.tiff.TiffSaver.writeIFDValue(TiffSaver.java:606)
        at loci.formats.tiff.TiffSaver.writeIFD(TiffSaver.java:531)
        at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.writeIFD(PyramidFromDirectoryWriter.java:866)
        at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.writeIFDs(PyramidFromDirectoryWriter.java:640)
        at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.convertToPyramid(PyramidFromDirectoryWriter.java:502)
        at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.call(PyramidFromDirectoryWriter.java:211)
        at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.call(PyramidFromDirectoryWriter.java:89)
        at picocli.CommandLine.executeUserObject(CommandLine.java:1783)
        at picocli.CommandLine.access$900(CommandLine.java:145)
        at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2150)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2144)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2108)
        at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1968)
        at picocli.CommandLine.parseWithHandlers(CommandLine.java:2349)
        at picocli.CommandLine.parseWithHandler(CommandLine.java:2284)
        at picocli.CommandLine.call(CommandLine.java:2560)
        at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.main(PyramidFromDirectoryWriter.java:189)
sbesson commented 3 years ago

Trying to reproduce, early candidate is that this could be related either to the size of the data, the number of planes (IFDs) or the size of the metadata.

Trying to rule out some of these factors I did some simple tests with a sample fake file with small XY dimension:

(bf2ometiff) [sbesson@idr2-slot3 ~]$  bioformats2raw "test&sizeX=56&size&sizeY=56&sizeZ=1024&sizeT=51.fake" /uod/idr-scratch/lightsheet/zarr --file_type zarr
OpenJDK 64-Bit Server VM warning: You have loaded library /tmp/opencv_openpnp6792726713756714297/nu/pattern/opencv/linux/x86_64/libopencv_java342.so which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
2021-03-12 15:40:47,328 [main] INFO  c.g.bioformats2raw.Converter - Output will be incompatible with raw2ometiff (pyramidName: data.zarr, scaleFormatString: %d/%d)
2021-03-12 15:40:47,837 [main] INFO  loci.formats.ImageReader - FakeReader initializing test&sizeX=56&size&sizeY=56&sizeZ=1024&sizeT=51.fake
...
2021-03-12 15:45:18,128 [pool-1-thread-1] INFO  c.g.bioformats2raw.Converter - successfully wrote at [0, 0, 1003, 0, 50] to /0/0
2021-03-12 15:45:18,128 [pool-1-thread-1] INFO  c.g.bioformats2raw.Converter - Successfully processed tile; resolution=0 plane=52203 xx=0 yy=0 width=56 height=56
(bf2ometiff) [sbesson@idr2-slot3 ~]$ raw2ometiff /uod/idr-scratch/lightsheet/zarr/ /uod/idr-scratch/lightsheet/test.ome.tiff
2021-03-12 15:46:00,581 [main] INFO  c.g.p.PyramidFromDirectoryWriter - Creating tiled pyramid file /uod/idr-scratch/lightsheet/test.ome.tiff
2021-03-12 15:46:00,703 [main] INFO  c.g.p.PyramidFromDirectoryWriter - Number of resolution levels: 1
2021-03-12 15:46:00,706 [main] INFO  c.g.p.PyramidFromDirectoryWriter - Adding metadata for resolution: 0
...
2021-03-12 15:47:58,839 [pool-1-thread-31] INFO  org.perf4j.TimingLogger - start[1615564078839] time[0] tag[writeTile]
2021-03-12 15:47:58,841 [main] INFO  org.perf4j.TimingLogger - start[1615564078839] time[2] tag[getInputTileBytes]
2021-03-12 15:47:58,841 [pool-1-thread-26] INFO  org.perf4j.TimingLogger - start[1615564078841] time[0] tag[writeTile]

The raw2ometiff container is still running and in the meantime the TIFF file appears as corrupted

[sbesson@idr0-slot3 ~]$ tiffinfo /uod/idr/scratch/idr2-slot3/lightsheet/test.ome.tiff 
TIFFFetchDirectory: Sanity check on directory count failed, this is probably not a valid IFD offset.
TIFFReadDirectory: Failed to read directory at offset 16.

I suspect the writing code is currently in its writeIFDs phase but I might need to add additional DEBUG-level logging statements to confirm what is at stake here.

melissalinkert commented 3 years ago

I'd definitely expect tiffinfo to report the output TIFF as corrupted if raw2ometiff hasn't finished yet. The first IFD pointer is not written to the header until all pixels and IFDs have been written:

https://github.com/glencoesoftware/raw2ometiff/blob/master/src/main/java/com/glencoesoftware/pyramid/PyramidFromDirectoryWriter.java#L688

joshmoore commented 3 years ago

Yup. Went poking around after creating this and found a live JVM stuck (?) at:

"main" glencoesoftware/bioformats2raw#1 prio=5 os_prio=0 tid=0x00007fb8f800c000 nid=0x15 runnable [0x00007fb8ff563000]
   java.lang.Thread.State: RUNNABLE
        at java.io.RandomAccessFile.setLength(Native Method)
        at loci.common.NIOFileHandle.setLength(NIOFileHandle.java:229)
        at loci.common.AbstractNIOHandle.validateLength(AbstractNIOHandle.java:87)
        at loci.common.NIOFileHandle.writeSetup(NIOFileHandle.java:659)
        at loci.common.NIOFileHandle.writeShort(NIOFileHandle.java:605)
        at loci.common.RandomAccessOutputStream.writeShort(RandomAccessOutputStream.java:333)
        at loci.formats.tiff.TiffSaver.writeIFDValue(TiffSaver.java:612)
        at loci.formats.tiff.TiffSaver.writeIFD(TiffSaver.java:531)
        at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.writeIFD(PyramidFromDirectoryWriter.java:866)
        at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.writeIFDs(PyramidFromDirectoryWriter.java:629)
        at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.convertToPyramid(PyramidFromDirectoryWriter.java:502)
        at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.call(PyramidFromDirectoryWriter.java:211)
        at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.call(PyramidFromDirectoryWriter.java:89)
        at picocli.CommandLine.executeUserObject(CommandLine.java:1783)
        at picocli.CommandLine.access$900(CommandLine.java:145)
        at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2150)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2144)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2108)
        at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1968)
        at picocli.CommandLine.parseWithHandlers(CommandLine.java:2349)
        at picocli.CommandLine.parseWithHandler(CommandLine.java:2284)
        at picocli.CommandLine.call(CommandLine.java:2560)
        at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.main(PyramidFromDirectoryWriter.java:189)

from 3 weeks ago...

joshmoore commented 3 years ago
[jamoore@idr2-slot3 ~]$ sudo cat /proc/54307/stack
[<ffffffff9f312436>] futex_wait_queue_me+0xc6/0x130
[<ffffffff9f3131db>] futex_wait+0x17b/0x280
[<ffffffff9f314f46>] do_futex+0x106/0x5a0
[<ffffffff9f315460>] SyS_futex+0x80/0x190
[<ffffffff9f995226>] tracesys+0xa6/0xcc
[<ffffffffffffffff>] 0xffffffffffffffff
[jamoore@idr2-slot3 ~]$ sudo cat /proc/54307/wchan
futex_wait_queue_me[jamoore@idr2-slot3 ~]$ sudo cat /proc/54307/status
Name:   java
Umask:  0022
State:  S (sleeping)
Tgid:   54307
Ngid:   0
Pid:    54307
PPid:   54287
...
joshmoore commented 3 years ago

Looks like the current run with the following dimensions has also hung:

XY=768
Z=768
C=2
T=100
XC=128
ZC=128
joshmoore commented 3 years ago

And these:

XY=512
Z=512
C=1
XC=64
ZC=64
T=100 OR 50 or 25 or 12 or 6 or 2 or 1

The T=12 run hung after half an hour, T=6 after about 10, T=2 after about 5 minutes, T=1 after less than 3 minutes

joshmoore commented 3 years ago

Giving up for now. Something seems to be quite wrong. (May be between keyboard & chair)

cc: @sbesson @melissalinkert @dgault

sbesson commented 3 years ago

From a Zarr fileset generated using bioformats2raw "test&sizeX=56&sizeY=56&sizeZ=1024&sizeT=51.fake" --file_type zarr --compression raw zarr, I measured the time taken for writing IFDs in various situations:

joshmoore commented 3 years ago

Looks like it's just GPFS-over-NFS on the one server. 🤦🏼

chris-allan commented 3 years ago

Safe to close this as a filesystem problem now, @joshmoore?

joshmoore commented 3 years ago

@chris-allan: I think so. I should note that now a second server has hung after running this code for an external period of time. That could still point to an underlying hardware issue, but I haven't heard back yet. Happy to re-close under that assumption and I can always re-open if I see it elsewhere. I do wonder what's going on under the hood though. :wink: