Closed joshmoore closed 3 years ago
Trying to reproduce, early candidate is that this could be related either to the size of the data, the number of planes (IFDs) or the size of the metadata.
Trying to rule out some of these factors I did some simple tests with a sample fake file with small XY dimension:
(bf2ometiff) [sbesson@idr2-slot3 ~]$ bioformats2raw "test&sizeX=56&size&sizeY=56&sizeZ=1024&sizeT=51.fake" /uod/idr-scratch/lightsheet/zarr --file_type zarr
OpenJDK 64-Bit Server VM warning: You have loaded library /tmp/opencv_openpnp6792726713756714297/nu/pattern/opencv/linux/x86_64/libopencv_java342.so which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
2021-03-12 15:40:47,328 [main] INFO c.g.bioformats2raw.Converter - Output will be incompatible with raw2ometiff (pyramidName: data.zarr, scaleFormatString: %d/%d)
2021-03-12 15:40:47,837 [main] INFO loci.formats.ImageReader - FakeReader initializing test&sizeX=56&size&sizeY=56&sizeZ=1024&sizeT=51.fake
...
2021-03-12 15:45:18,128 [pool-1-thread-1] INFO c.g.bioformats2raw.Converter - successfully wrote at [0, 0, 1003, 0, 50] to /0/0
2021-03-12 15:45:18,128 [pool-1-thread-1] INFO c.g.bioformats2raw.Converter - Successfully processed tile; resolution=0 plane=52203 xx=0 yy=0 width=56 height=56
(bf2ometiff) [sbesson@idr2-slot3 ~]$ raw2ometiff /uod/idr-scratch/lightsheet/zarr/ /uod/idr-scratch/lightsheet/test.ome.tiff
2021-03-12 15:46:00,581 [main] INFO c.g.p.PyramidFromDirectoryWriter - Creating tiled pyramid file /uod/idr-scratch/lightsheet/test.ome.tiff
2021-03-12 15:46:00,703 [main] INFO c.g.p.PyramidFromDirectoryWriter - Number of resolution levels: 1
2021-03-12 15:46:00,706 [main] INFO c.g.p.PyramidFromDirectoryWriter - Adding metadata for resolution: 0
...
2021-03-12 15:47:58,839 [pool-1-thread-31] INFO org.perf4j.TimingLogger - start[1615564078839] time[0] tag[writeTile]
2021-03-12 15:47:58,841 [main] INFO org.perf4j.TimingLogger - start[1615564078839] time[2] tag[getInputTileBytes]
2021-03-12 15:47:58,841 [pool-1-thread-26] INFO org.perf4j.TimingLogger - start[1615564078841] time[0] tag[writeTile]
The raw2ometiff
container is still running and in the meantime the TIFF file appears as corrupted
[sbesson@idr0-slot3 ~]$ tiffinfo /uod/idr/scratch/idr2-slot3/lightsheet/test.ome.tiff
TIFFFetchDirectory: Sanity check on directory count failed, this is probably not a valid IFD offset.
TIFFReadDirectory: Failed to read directory at offset 16.
I suspect the writing code is currently in its writeIFDs
phase but I might need to add additional DEBUG-level logging statements to confirm what is at stake here.
I'd definitely expect tiffinfo
to report the output TIFF as corrupted if raw2ometiff
hasn't finished yet. The first IFD pointer is not written to the header until all pixels and IFDs have been written:
Yup. Went poking around after creating this and found a live JVM stuck (?) at:
"main" glencoesoftware/bioformats2raw#1 prio=5 os_prio=0 tid=0x00007fb8f800c000 nid=0x15 runnable [0x00007fb8ff563000]
java.lang.Thread.State: RUNNABLE
at java.io.RandomAccessFile.setLength(Native Method)
at loci.common.NIOFileHandle.setLength(NIOFileHandle.java:229)
at loci.common.AbstractNIOHandle.validateLength(AbstractNIOHandle.java:87)
at loci.common.NIOFileHandle.writeSetup(NIOFileHandle.java:659)
at loci.common.NIOFileHandle.writeShort(NIOFileHandle.java:605)
at loci.common.RandomAccessOutputStream.writeShort(RandomAccessOutputStream.java:333)
at loci.formats.tiff.TiffSaver.writeIFDValue(TiffSaver.java:612)
at loci.formats.tiff.TiffSaver.writeIFD(TiffSaver.java:531)
at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.writeIFD(PyramidFromDirectoryWriter.java:866)
at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.writeIFDs(PyramidFromDirectoryWriter.java:629)
at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.convertToPyramid(PyramidFromDirectoryWriter.java:502)
at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.call(PyramidFromDirectoryWriter.java:211)
at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.call(PyramidFromDirectoryWriter.java:89)
at picocli.CommandLine.executeUserObject(CommandLine.java:1783)
at picocli.CommandLine.access$900(CommandLine.java:145)
at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2150)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2144)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2108)
at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1968)
at picocli.CommandLine.parseWithHandlers(CommandLine.java:2349)
at picocli.CommandLine.parseWithHandler(CommandLine.java:2284)
at picocli.CommandLine.call(CommandLine.java:2560)
at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.main(PyramidFromDirectoryWriter.java:189)
from 3 weeks ago...
[jamoore@idr2-slot3 ~]$ sudo cat /proc/54307/stack
[<ffffffff9f312436>] futex_wait_queue_me+0xc6/0x130
[<ffffffff9f3131db>] futex_wait+0x17b/0x280
[<ffffffff9f314f46>] do_futex+0x106/0x5a0
[<ffffffff9f315460>] SyS_futex+0x80/0x190
[<ffffffff9f995226>] tracesys+0xa6/0xcc
[<ffffffffffffffff>] 0xffffffffffffffff
[jamoore@idr2-slot3 ~]$ sudo cat /proc/54307/wchan
futex_wait_queue_me[jamoore@idr2-slot3 ~]$ sudo cat /proc/54307/status
Name: java
Umask: 0022
State: S (sleeping)
Tgid: 54307
Ngid: 0
Pid: 54307
PPid: 54287
...
Looks like the current run with the following dimensions has also hung:
XY=768
Z=768
C=2
T=100
XC=128
ZC=128
And these:
XY=512
Z=512
C=1
XC=64
ZC=64
T=100 OR 50 or 25 or 12 or 6 or 2 or 1
The T=12 run hung after half an hour, T=6 after about 10, T=2 after about 5 minutes, T=1 after less than 3 minutes
Giving up for now. Something seems to be quite wrong. (May be between keyboard & chair)
cc: @sbesson @melissalinkert @dgault
From a Zarr fileset generated using bioformats2raw "test&sizeX=56&sizeY=56&sizeZ=1024&sizeT=51.fake" --file_type zarr --compression raw zarr
, I measured the time taken for writing IFDs in various situations:
Looks like it's just GPFS-over-NFS on the one server. 🤦🏼
Safe to close this as a filesystem problem now, @joshmoore?
@chris-allan: I think so. I should note that now a second server has hung after running this code for an external period of time. That could still point to an underlying hardware issue, but I haven't heard back yet. Happy to re-close under that assumption and I can always re-open if I see it elsewhere. I do wonder what's going on under the hood though. :wink:
Generating an OME-TIFF from an IMS that's
1024^3 (t=512, c=2)
hangs on setLength. Changing the ending (e.g., to.ome.btf
) doesn't seem to change the behavior.