saalfeldlab / stitching-spark

Reconstruct big images from overlapping tiled images on a Spark cluster.
GNU General Public License v2.0
35 stars 10 forks source link

Run ConvertTIFFTilesToN5Spark on a Dataproc cluster #40

Open carshadi opened 2 years ago

carshadi commented 2 years ago

Hi there,

I am trying to run the ConvertTIFFTilesToN5Spark step on a Dataproc cluster, where the tiff tiles and json configuration file are both located in a google storage bucket.

The issue is that it fails to load the tiff tiles from the bucket. Error:

22/05/13 19:56:10 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1) (convert-tiff-tiles-to-n5-spark-cluster-m.us-west1-a.c.neural-dynamics-338018.internal executor 1): java.lang.NullPointerException
    at net.imglib2.img.ImagePlusAdapter.wrapLocal(ImagePlusAdapter.java:97)
    at net.imglib2.img.ImagePlusAdapter.wrap(ImagePlusAdapter.java:74)
    at net.imglib2.img.imageplus.ImagePlusImgs.from(ImagePlusImgs.java:210)
    at org.janelia.stitching.ConvertTIFFTilesToN5Spark.convertTileToN5(ConvertTIFFTilesToN5Spark.java:207)
    at org.janelia.stitching.ConvertTIFFTilesToN5Spark.lambda$convertTilesToN5$cbf5f68e$1(ConvertTIFFTilesToN5Spark.java:161)
    at org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
    at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
    at scala.collection.Iterator.foreach$(Iterator.scala:943)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
    at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
    at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
    at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
    at scala.collection.AbstractIterator.to(Iterator.scala:1431)
    at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
    at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431)
    at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
    at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
    at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
    at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

My tile_config.json looks like this (bucket name redacted)

[{"index": 0, "file": "gs://xxxx/spark-stitching-test/Ex_488_em_525_merged_tiffs/111590/111590_088050/000000.tiff", "position": [0.0, 0.0, 0.0], "size": [2000, 1600, 420], "pixel_resolution": [1.8, 1.8, 2.0], "type": "GRAY16"}, 
{"index": 1, "file": "gs://xxxx/spark-stitching-test/Ex_488_em_525_merged_tiffs/111590/111590_088050/008400.tiff", "position": [0.0, 0.0, 420.0], "size": [2000, 1600, 420], "pixel_resolution": [1.8, 1.8, 2.0], "type": "GRAY16"}, 
{"index": 2, "file": "gs://xxxx/spark-stitching-test/Ex_488_em_525_merged_tiffs/111590/111590_088050/016800.tiff", "position": [0.0, 0.0, 840.0], "size": [2000, 1600, 420], "pixel_resolution": [1.8, 1.8, 2.0], "type": "GRAY16"}

My job looks like this

image

It appears that the GoogleCloudDataProvider simply calls IJ.openImage() on the Tiff path, without downloading the blob to a temporary directory first. Am I correct in assuming that the way I'm running this isn't supported? Or do I just need to format things differently? https://github.com/saalfeldlab/stitching-spark/blob/e118564b283fc2b375303516a78f8f3909bc5d6e/src/main/java/org/janelia/util/ImageImporter.java#L19-L27

Thank you!

carshadi commented 2 years ago

Update: I got it to work by adding the following to https://github.com/saalfeldlab/stitching-spark/blob/e1b6d2e6fa06c5735b2a7b1dd3583d5d5f31f734/src/main/java/org/janelia/dataaccess/googlecloud/GoogleCloudDataProvider.java#L201-L207

    @Override
    public ImagePlus loadImage( final String link ) throws IOException
    {
        if ( link.endsWith( ".tif" ) || link.endsWith( ".tiff" ) )
        {
            if (link.startsWith("gs:"))
            {
                Path tempPath = null;
                ImagePlus imp = null;
                try
                {
                    tempPath = Files.createTempFile( null, ".tif");
                    final GoogleCloudStorageURI googleCloudUri = new GoogleCloudStorageURI( link );
                    final Blob blob = storage.get( BlobId.of( googleCloudUri.getBucket(), googleCloudUri.getKey() ) );
                    blob.downloadTo(tempPath);
                    imp = ImageImporter.openImage(tempPath.toString());
                }
                finally
                {
                    if ( tempPath != null )
                        tempPath.toFile().delete();
                }
                return imp;
            }
            else 
            {
                return ImageImporter.openImage( link );
            }
        }
        throw new NotImplementedException( "Only TIFF images are supported at the moment" );
    }

Please let me know if there are any anticipated issues

Thanks, Cameron