saalfeldlab / stitching-spark

Reconstruct big images from overlapping tiled images on a Spark cluster.
GNU General Public License v2.0
34 stars 9 forks source link

Flatfield Correction memory requirements #42

Open carshadi opened 2 years ago

carshadi commented 2 years ago

Hello,

My dataset is composed of 1200 tiles of shape [2000,1600,105], unsigned 16-bit ints. Each tile is ~640MB, and the total dataset is 768GB.

I had tried several Dataproc cluster configurations but would always run out of memory before the job finished. Here is the log from one such failed run:

Working interval is at [0, 0, 0] of size [2000, 1600, 105]
Working with stack of size 1120
Output directory: gs://xxxx/spark-stitching-test/tile_config-flatfield/fullsize/solution
Running flatfield correction script in 3D mode
Histogram intensity range: min=0.0, max=596.0
Background intensity value: 2.0
Binning the input stack and saving as N5 blocks...
22/05/31 09:36:11 ERROR org.apache.spark.scheduler.AsyncEventQueue: Dropping event from queue eventLog. This likely means one of the listeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler.
22/05/31 09:36:11 WARN org.apache.spark.scheduler.AsyncEventQueue: Dropped 1 events from eventLog since the application started.
Collected reference histogram of size 258 (first and last bins are tail bins):
[0.0, 437.7783683452381, 81.77367467857142, 46.8757134047619, 47.477869, 24.27659158333333, 20.217159428571428, 24.67996761904762, 13.484565666666667, 11.611294119047619, 14.610392464285715, 8.293473321428571, 7.379666666666667, 9.752576595238095, 5.83438805952381, 5.138283369047619, 6.672545178571428, 3.938613880952381, 3.604872380952381, 4.9050658333333335, 2.9963505, 2.8183993095238096, 3.935544, 2.454618130952381, 2.342203142857143, 3.3578584166666667, 2.1719595833333334, 2.1521068928571427, 3.1912209285714286, 2.0808182619047617, 2.057839535714286, 3.0716244761904763, 2.0486201904761905, 2.0560531785714287, 3.1066629285714287, 2.091556119047619, 2.1118542023809526, 3.2133971071428573, 2.176664880952381, 2.2079175714285713, 3.376959154761905, 2.293583488095238, 2.3201689880952383, 3.5095552261904763, 2.3501815, 2.354911095238095, 3.5369314166666665, 2.358748130952381, 2.3579966904761904, 3.5322006071428573, 2.350168214285714, 2.3450787976190477, 3.5074072857142857, 2.330146857142857, 2.3218056666666667, 3.4597665714285712, 2.2850581785714286, 2.2618299404761903, 3.343573261904762, 2.1940888095238096, 2.165262333333333, 3.1936283452380954, 2.092879726190476, 2.0641206666666667, 2.0355645, 3.0004284285714284, 1.9654259047619047, 1.9381174285714287, 2.8543376904761906, 1.866112130952381, 1.835421869047619, 2.6926035119047618, 1.7538462976190476, 1.7213971547619047, 2.52253375, 1.6434050476190476, 1.6141225714285714, 2.367681869047619, 1.544810738095238, 1.5184377976190475, 2.2299729642857145, 1.4552315119047619, 1.4303141071428571, 2.099422119047619, 1.3690689404761904, 1.3446135714285714, 1.971852369047619, 1.285029369047619, 1.2631100238095239, 1.854925011904762, 1.2108501428571428, 1.1915200595238096, 1.752440119047619, 1.1456031309523809, 1.127990380952381, 1.658482880952381, 1.0839164285714287, 1.0666084642857143, 1.567604988095238, 1.0240427142857143, 1.0078844166666667, 1.4823280357142856, 0.969960880952381, 0.9559713928571428, 1.4111897142857144, 0.9262692261904761, 0.91561625, 1.3544762023809525, 0.890556880952381, 0.8807474166666667, 1.3031114047619048, 0.8570969404761904, 0.8480894404761905, 1.2557332261904761, 0.8268718095238096, 0.8195214285714286, 1.2149739404761906, 0.8017395833333333, 0.7954338333333333, 1.1819282976190477, 0.7809344761904762, 0.7757366904761904, 1.154225892857143, 0.7637096428571428, 0.7590340238095238, 1.1298775833333334, 0.7474632142857143, 0.7429755833333334, 0.7380995952380952, 1.0983775119047618, 0.7262675357142857, 0.7218658214285715, 1.074690619047619, 0.7112021190476191, 0.7072245, 1.0541304523809525, 0.6980834047619048, 0.6944824642857143, 1.0356507261904762, 0.6859677023809524, 0.6826705119047619, 1.0172985833333332, 0.6741086428571429, 0.6706799523809523, 0.9999159642857143, 0.6627480238095238, 0.6596391547619047, 0.9839666190476191, 0.6525576785714285, 0.6499785, 0.9700708690476191, 0.6436101428571429, 0.6411394047619048, 0.9570107976190476, 0.6348474285714286, 0.6324135357142857, 0.9440853333333333, 0.6262217380952381, 0.6238707380952381, 0.9315843095238096, 0.6182572619047619, 0.6162029523809524, 0.9202849523809524, 0.6108643452380952, 0.6087792261904762, 0.909237630952381, 0.6035397023809523, 0.6010273333333334, 0.8971875238095238, 0.5952812619047619, 0.5925989523809524, 0.8841085595238095, 0.5863143928571428, 0.5834745238095238, 0.8707290238095238, 0.5772933214285715, 0.5747278333333333, 0.8573223571428571, 0.5682029047619047, 0.5656684404761905, 0.84311375, 0.5587734285714285, 0.5557833214285715, 0.8280252023809523, 0.5484250476190476, 0.5450879761904762, 0.81207725, 0.5374864880952381, 0.5344990833333333, 0.7961647142857143, 0.5270869523809524, 0.5241008214285714, 0.5211760833333333, 0.7760217142857143, 0.5136853928571429, 0.5109437380952381, 0.7608697261904762, 0.5039475833333333, 0.5011205595238095, 0.7468406666666667, 0.49464491666666666, 0.49217063095238095, 0.7338129523809523, 0.4861535, 0.48381580952380954, 0.7214703452380953, 0.47814615476190475, 0.476042619047619, 0.7102251071428571, 0.47076435714285714, 0.4687870238095238, 0.6995748452380952, 0.4640189880952381, 0.46221634523809524, 0.6898104880952382, 0.45763016666666667, 0.4558778095238095, 0.6806325357142857, 0.4518012619047619, 0.4501145357142857, 0.6721814761904762, 0.4463145, 0.44457219047619045, 0.6640260595238096, 0.440697380952381, 0.43915423809523807, 0.6558015952380952, 0.43531659523809524, 0.4337336785714286, 0.6476901428571429, 0.4297576904761905, 0.4282166785714286, 0.6391359642857143, 0.4241249047619048, 0.42246446428571427, 0.6305123809523809, 0.4181372857142857, 0.41641934523809526, 0.6212703928571428, 0.4119727976190476, 0.4100477261904762, 0.6116464404761904, 0.40528240476190475, 0.40338982142857144, 0.6014080714285714, 0.3984663333333333, 0.3963685476190476, 0.5907914642857143, 0.39122344047619045, 0.38909815476190474, 0.5797881190476191, 0.3837952261904762, 0.38173467857142857, 0.5684315595238095, 0.37630688095238096, 0.37413361904761905, 0.5570520952380953, 56.52632755952381]

Solving for scale 6:  size=[31, 25, 2],  model=AffineModel, regularizer=IdentityModel
Solving for scale 5:  size=[63, 50, 3],  model=AffineModel, regularizer=AffineModel
Solving for scale 4:  size=[125, 100, 7],  model=AffineModel, regularizer=AffineModel
Solving for scale 3:  size=[250, 200, 13],  model=AffineModel, regularizer=AffineModel
Solving for scale 2:  size=[500, 400, 26],  model=FixedScalingAffineModel, regularizer=AffineModel
Solving for scale 1:  size=[1000, 800, 53],  model=FixedScalingAffineModel, regularizer=AffineModel
Solving for scale 0:  size=[2000, 1600, 105],  model=FixedScalingAffineModel, regularizer=AffineModel
22/05/31 09:58:34 INFO org.sparkproject.jetty.server.AbstractConnector: Stopped Spark@16073fa8{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at net.imglib2.img.basictypeaccess.array.AbstractDoubleArray.<init>(AbstractDoubleArray.java:50)
    at net.imglib2.img.basictypeaccess.array.DoubleArray.<init>(DoubleArray.java:47)
    at net.imglib2.img.basictypeaccess.array.DoubleArray.createArray(DoubleArray.java:58)
    at net.imglib2.img.basictypeaccess.array.DoubleArray.createArray(DoubleArray.java:43)
    at net.imglib2.img.array.ArrayImgFactory.create(ArrayImgFactory.java:91)
    at net.imglib2.img.array.ArrayImgFactory.create(ArrayImgFactory.java:68)
    at net.imglib2.img.array.ArrayImgs.doubles(ArrayImgs.java:558)
    at org.janelia.flatfield.FlatfieldCorrectionSolver.unpivotSolution(FlatfieldCorrectionSolver.java:414)
    at org.janelia.flatfield.FlatfieldCorrection.run(FlatfieldCorrection.java:391)
    at org.janelia.flatfield.FlatfieldCorrection.run(FlatfieldCorrection.java:195)
    at org.janelia.flatfield.FlatfieldCorrection.main(FlatfieldCorrection.java:80)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The configuration that finally worked:

I did not change any yarn / spark cluster or job properties.

The job took 25.1hrs to run.

From the executors page, it shows peak JVM on-heap memory up to ~60GB per executor (full disclosure, I don't have a great idea about what these metrics mean).

executors

With 8 cores per executor, that's a ~8GB minimum requirement per core. That gives us 8 * 96 = 768GB required memory, which is the size of my full dataset. Is this expected in the general case? Does it depend on the number of cores used?

Thank you, Cameron

P.S. is this step mandatory, or can I just skip to the stitching after converting the input tiles to N5?