Run ConvertTIFFTilesToN5Spark on a Dataproc cluster

Hi there,

I am trying to run the ConvertTIFFTilesToN5Spark step on a Dataproc cluster, where the tiff tiles and json configuration file are both located in a google storage bucket.

The issue is that it fails to load the tiff tiles from the bucket. Error:

22/05/13 19:56:10 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1) (convert-tiff-tiles-to-n5-spark-cluster-m.us-west1-a.c.neural-dynamics-338018.internal executor 1): java.lang.NullPointerException
    at net.imglib2.img.ImagePlusAdapter.wrapLocal(ImagePlusAdapter.java:97)
    at net.imglib2.img.ImagePlusAdapter.wrap(ImagePlusAdapter.java:74)
    at net.imglib2.img.imageplus.ImagePlusImgs.from(ImagePlusImgs.java:210)
    at org.janelia.stitching.ConvertTIFFTilesToN5Spark.convertTileToN5(ConvertTIFFTilesToN5Spark.java:207)
    at org.janelia.stitching.ConvertTIFFTilesToN5Spark.lambda$convertTilesToN5$cbf5f68e$1(ConvertTIFFTilesToN5Spark.java:161)
    at org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
    at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
    at scala.collection.Iterator.foreach$(Iterator.scala:943)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
    at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
    at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
    at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
    at scala.collection.AbstractIterator.to(Iterator.scala:1431)
    at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
    at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431)
    at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
    at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
    at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
    at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

My tile_config.json looks like this (bucket name redacted)

[{"index": 0, "file": "gs://xxxx/spark-stitching-test/Ex_488_em_525_merged_tiffs/111590/111590_088050/000000.tiff", "position": [0.0, 0.0, 0.0], "size": [2000, 1600, 420], "pixel_resolution": [1.8, 1.8, 2.0], "type": "GRAY16"}, 
{"index": 1, "file": "gs://xxxx/spark-stitching-test/Ex_488_em_525_merged_tiffs/111590/111590_088050/008400.tiff", "position": [0.0, 0.0, 420.0], "size": [2000, 1600, 420], "pixel_resolution": [1.8, 1.8, 2.0], "type": "GRAY16"}, 
{"index": 2, "file": "gs://xxxx/spark-stitching-test/Ex_488_em_525_merged_tiffs/111590/111590_088050/016800.tiff", "position": [0.0, 0.0, 840.0], "size": [2000, 1600, 420], "pixel_resolution": [1.8, 1.8, 2.0], "type": "GRAY16"}

My job looks like this

It appears that the GoogleCloudDataProvider simply calls IJ.openImage() on the Tiff path, without downloading the blob to a temporary directory first. Am I correct in assuming that the way I'm running this isn't supported? Or do I just need to format things differently? https://github.com/saalfeldlab/stitching-spark/blob/e118564b283fc2b375303516a78f8f3909bc5d6e/src/main/java/org/janelia/util/ImageImporter.java#L19-L27

Thank you!

saalfeldlab / stitching-spark

Run ConvertTIFFTilesToN5Spark on a Dataproc cluster #40