SIFTPointMatchClient spark memory issues

martinschorb commented 4 years ago

Hi,

I am running into java.lang.OutOfMemoryError: Java heap space when trying to run the SIFT client on spark. Even though every worker has 4GB of memory assigned, it only initializes them with 1024M and shows them with much less (366M) when running.

Any idea what could cause this?

python s/code/render-modules/rendermodules/pointmatch/generate_point_matches_spark.py 
/g/emcf/software/python/miniconda/envs/render/lib/python3.7/site-packages/argschema/utils.py:346: FutureWarning: '--spark_files' is using old-style command-line syntax with each element as a separate argument. This will not be supported in argschema after 2.0. See http://argschema.readthedocs.io/en/master/user/intro.html#command-line-specification for details.
  warnings.warn(warn_msg, FutureWarning)
WARNING:root:setting Dict fields not supported from argparse
20/10/19 09:07:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.janelia.render.client.ClientRunner).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/10/19 09:07:04 INFO SparkContext: Running Spark version 3.0.0
20/10/19 09:07:04 INFO ResourceUtils: ==============================================================
20/10/19 09:07:04 INFO ResourceUtils: Resources for spark.driver:

20/10/19 09:07:04 INFO ResourceUtils: ==============================================================
20/10/19 09:07:04 INFO SparkContext: Submitted application: SIFTPointMatchClient
20/10/19 09:07:05 INFO SecurityManager: Changing view acls to: schorb
20/10/19 09:07:05 INFO SecurityManager: Changing modify acls to: schorb
20/10/19 09:07:05 INFO SecurityManager: Changing view acls groups to: 
20/10/19 09:07:05 INFO SecurityManager: Changing modify acls groups to: 
20/10/19 09:07:05 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(schorb); groups with view permissions: Set(); users  with modify permissions: Set(schorb); groups with modify permissions: Set()
20/10/19 09:07:05 INFO Utils: Successfully started service 'sparkDriver' on port 38858.
20/10/19 09:07:05 INFO SparkEnv: Registering MapOutputTracker
20/10/19 09:07:05 INFO SparkEnv: Registering BlockManagerMaster
20/10/19 09:07:05 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/10/19 09:07:05 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/10/19 09:07:05 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
20/10/19 09:07:05 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-eb26ca50-7534-4488-887c-f7ff17e7cd66
20/10/19 09:07:05 INFO MemoryStore: MemoryStore started with capacity 3.0 GiB
20/10/19 09:07:05 INFO SparkEnv: Registering OutputCommitCoordinator
20/10/19 09:07:05 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/10/19 09:07:05 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://login-gui01.cluster.embl.de:4040
20/10/19 09:07:05 INFO SparkContext: Added JAR file:/g/emcf/software/render/render-ws-spark-client/target/render-ws-spark-client-2.3.1-SNAPSHOT-standalone.jar at spark://login-gui01.cluster.embl.de:38858/jars/render-ws-spark-client-2.3.1-SNAPSHOT-standalone.jar with timestamp 1603091225720
20/10/19 09:07:05 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://sb01-01.cluster.embl.de:7077...
20/10/19 09:07:05 INFO TransportClientFactory: Successfully created connection to sb01-01.cluster.embl.de/10.11.13.100:7077 after 37 ms (0 ms spent in bootstraps)
20/10/19 09:07:06 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20201019090706-0002
20/10/19 09:07:06 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20201019090706-0002/0 on worker-20201019085837-10.11.13.101-37472 (10.11.13.101:37472) with 4 core(s)
20/10/19 09:07:06 INFO StandaloneSchedulerBackend: Granted executor ID app-20201019090706-0002/0 on hostPort 10.11.13.101:37472 with 4 core(s), 1024.0 MiB RAM
20/10/19 09:07:06 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20201019090706-0002/1 on worker-20201019085837-10.11.13.103-35637 (10.11.13.103:35637) with 4 core(s)
20/10/19 09:07:06 INFO StandaloneSchedulerBackend: Granted executor ID app-20201019090706-0002/1 on hostPort 10.11.13.103:35637 with 4 core(s), 1024.0 MiB RAM
20/10/19 09:07:06 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20201019090706-0002/2 on worker-20201019085837-10.11.13.109-40001 (10.11.13.109:40001) with 4 core(s)
20/10/19 09:07:06 INFO StandaloneSchedulerBackend: Granted executor ID app-20201019090706-0002/2 on hostPort 10.11.13.109:40001 with 4 core(s), 1024.0 MiB RAM
20/10/19 09:07:06 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20201019090706-0002/3 on worker-20201019085837-10.11.13.104-44201 (10.11.13.104:44201) with 4 core(s)
20/10/19 09:07:06 INFO StandaloneSchedulerBackend: Granted executor ID app-20201019090706-0002/3 on hostPort 10.11.13.104:44201 with 4 core(s), 1024.0 MiB RAM
20/10/19 09:07:06 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20201019090706-0002/4 on worker-20201019085837-10.11.13.110-38557 (10.11.13.110:38557) with 4 core(s)
20/10/19 09:07:06 INFO StandaloneSchedulerBackend: Granted executor ID app-20201019090706-0002/4 on hostPort 10.11.13.110:38557 with 4 core(s), 1024.0 MiB RAM
20/10/19 09:07:06 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20201019090706-0002/5 on worker-20201019085837-10.11.13.106-34543 (10.11.13.106:34543) with 4 core(s)
20/10/19 09:07:06 INFO StandaloneSchedulerBackend: Granted executor ID app-20201019090706-0002/5 on hostPort 10.11.13.106:34543 with 4 core(s), 1024.0 MiB RAM
20/10/19 09:07:06 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20201019090706-0002/6 on worker-20201019085837-10.11.13.108-45982 (10.11.13.108:45982) with 4 core(s)
20/10/19 09:07:06 INFO StandaloneSchedulerBackend: Granted executor ID app-20201019090706-0002/6 on hostPort 10.11.13.108:45982 with 4 core(s), 1024.0 MiB RAM
20/10/19 09:07:06 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20201019090706-0002/7 on worker-20201019085837-10.11.13.107-42835 (10.11.13.107:42835) with 4 core(s)
20/10/19 09:07:06 INFO StandaloneSchedulerBackend: Granted executor ID app-20201019090706-0002/7 on hostPort 10.11.13.107:42835 with 4 core(s), 1024.0 MiB RAM
20/10/19 09:07:06 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 36088.
20/10/19 09:07:06 INFO NettyBlockTransferService: Server created on login-gui01.cluster.embl.de:36088
20/10/19 09:07:06 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/10/19 09:07:06 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, login-gui01.cluster.embl.de, 36088, None)
20/10/19 09:07:06 INFO BlockManagerMasterEndpoint: Registering block manager login-gui01.cluster.embl.de:36088 with 3.0 GiB RAM, BlockManagerId(driver, login-gui01.cluster.embl.de, 36088, None)
20/10/19 09:07:06 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, login-gui01.cluster.embl.de, 36088, None)
20/10/19 09:07:06 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, login-gui01.cluster.embl.de, 36088, None)
20/10/19 09:07:06 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20201019090706-0002/7 is now RUNNING
20/10/19 09:07:06 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20201019090706-0002/5 is now RUNNING
20/10/19 09:07:06 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20201019090706-0002/1 is now RUNNING
20/10/19 09:07:06 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20201019090706-0002/3 is now RUNNING
20/10/19 09:07:06 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20201019090706-0002/0 is now RUNNING
20/10/19 09:07:06 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20201019090706-0002/6 is now RUNNING
20/10/19 09:07:06 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20201019090706-0002/4 is now RUNNING
20/10/19 09:07:06 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20201019090706-0002/2 is now RUNNING
20/10/19 09:07:06 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
20/10/19 09:07:07 INFO SIFTPointMatchClient: run: appId is app-20201019090706-0002, executors data is [ {
  "id" : "driver",
  "hostPort" : "login-gui01.cluster.embl.de:36088",
  "isActive" : true,
  "rddBlocks" : 0,
  "memoryUsed" : 0,
  "diskUsed" : 0,
  "totalCores" : 0,
  "maxTasks" : 0,
  "activeTasks" : 0,
  "failedTasks" : 0,
  "completedTasks" : 0,
  "totalTasks" : 0,
  "totalDuration" : 0,
  "totalGCTime" : 0,
  "totalInputBytes" : 0,
  "totalShuffleRead" : 0,
  "totalShuffleWrite" : 0,
  "isBlacklisted" : false,
  "maxMemory" : 3247335014,
  "addTime" : "2020-10-19T07:07:06.122GMT",
  "executorLogs" : { },
  "memoryMetrics" : {
    "usedOnHeapStorageMemory" : 0,
    "usedOffHeapStorageMemory" : 0,
    "totalOnHeapStorageMemory" : 3247335014,
    "totalOffHeapStorageMemory" : 0
  },
  "blacklistedInStages" : [ ],
  "attributes" : { },
  "resources" : { }
} ]
20/10/19 09:07:07 INFO SIFTPointMatchClient: generateMatchesForPairFile: pairJsonFileName is /g/emcf/schorb/tile_pairs_test2_mipmap_z_442_to_450_dist_50.json
20/10/19 09:07:07 INFO RenderableCanvasIdPairs: load: entry, path=/g/emcf/schorb/tile_pairs_test2_mipmap_z_442_to_450_dist_50.json
20/10/19 09:07:07 INFO RenderableCanvasIdPairs: load: exit, loaded 1443 pairs
20/10/19 09:07:07 INFO CanvasRenderParametersUrlTemplate: getTemplateForRun: returning http://pc-emcf-16.embl.de:8080/render-ws/v1/owner/test/project/RENDERmodule_TEST/stack/test2_mipmap/tile/{id}/render-parameters?scale=0.3&filter=true&excludeMask=true&normalizeForMatching=true
20/10/19 09:07:07 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
20/10/19 09:07:07 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 48.0 B, free 3.0 GiB)
20/10/19 09:07:07 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 120.0 B, free 3.0 GiB)
20/10/19 09:07:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on login-gui01.cluster.embl.de:36088 (size: 120.0 B, free: 3.0 GiB)
20/10/19 09:07:07 INFO SparkContext: Created broadcast 0 from broadcast at SIFTPointMatchClient.java:181
20/10/19 09:07:07 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 168.0 B, free 3.0 GiB)
20/10/19 09:07:07 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 547.0 B, free 3.0 GiB)
20/10/19 09:07:07 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on login-gui01.cluster.embl.de:36088 (size: 547.0 B, free: 3.0 GiB)
20/10/19 09:07:07 INFO SparkContext: Created broadcast 1 from broadcast at SIFTPointMatchClient.java:182
20/10/19 09:07:07 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 304.0 B, free 3.0 GiB)
20/10/19 09:07:07 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 553.0 B, free 3.0 GiB)
20/10/19 09:07:07 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on login-gui01.cluster.embl.de:36088 (size: 553.0 B, free: 3.0 GiB)
20/10/19 09:07:07 INFO SparkContext: Created broadcast 2 from broadcast at SIFTPointMatchClient.java:184
20/10/19 09:07:07 INFO SIFTPointMatchClient: generateMatchesForPairs: 2 partitions, debug string is: 
(2) MapPartitionsRDD[2] at mapPartitionsWithIndex at SIFTPointMatchClient.java:248 []
 |  MapPartitionsRDD[1] at mapPartitionsWithIndex at SIFTPointMatchClient.java:190 []
 |  ParallelCollectionRDD[0] at parallelize at SIFTPointMatchClient.java:188 []
20/10/19 09:07:07 INFO SparkContext: Starting job: collect at SIFTPointMatchClient.java:255
20/10/19 09:07:08 INFO DAGScheduler: Got job 0 (collect at SIFTPointMatchClient.java:255) with 2 output partitions
20/10/19 09:07:08 INFO DAGScheduler: Final stage: ResultStage 0 (collect at SIFTPointMatchClient.java:255)
20/10/19 09:07:08 INFO DAGScheduler: Parents of final stage: List()
20/10/19 09:07:08 INFO DAGScheduler: Missing parents: List()
20/10/19 09:07:08 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at mapPartitionsWithIndex at SIFTPointMatchClient.java:248), which has no missing parents
20/10/19 09:07:08 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.11.13.103:58600) with ID 1
20/10/19 09:07:08 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 6.0 KiB, free 3.0 GiB)
20/10/19 09:07:08 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.11.13.107:39014) with ID 7
20/10/19 09:07:08 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.11.13.109:32896) with ID 2
20/10/19 09:07:08 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.11.13.106:38680) with ID 5
20/10/19 09:07:08 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.11.13.104:41756) with ID 3
20/10/19 09:07:08 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 3.2 KiB, free 3.0 GiB)
20/10/19 09:07:08 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.11.13.108:54906) with ID 6
20/10/19 09:07:08 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on login-gui01.cluster.embl.de:36088 (size: 3.2 KiB, free: 3.0 GiB)
20/10/19 09:07:08 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1200
20/10/19 09:07:08 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at mapPartitionsWithIndex at SIFTPointMatchClient.java:248) (first 15 tasks are for partitions Vector(0, 1))
20/10/19 09:07:08 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
20/10/19 09:07:08 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.11.13.110:39842) with ID 4
20/10/19 09:07:08 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.11.13.101:49476) with ID 0
20/10/19 09:07:08 INFO BlockManagerMasterEndpoint: Registering block manager 10.11.13.103:39283 with 366.3 MiB RAM, BlockManagerId(1, 10.11.13.103, 39283, None)
20/10/19 09:07:08 INFO BlockManagerMasterEndpoint: Registering block manager 10.11.13.106:44533 with 366.3 MiB RAM, BlockManagerId(5, 10.11.13.106, 44533, None)
20/10/19 09:07:08 INFO BlockManagerMasterEndpoint: Registering block manager 10.11.13.107:44657 with 366.3 MiB RAM, BlockManagerId(7, 10.11.13.107, 44657, None)
20/10/19 09:07:08 INFO BlockManagerMasterEndpoint: Registering block manager 10.11.13.109:34489 with 366.3 MiB RAM, BlockManagerId(2, 10.11.13.109, 34489, None)
20/10/19 09:07:08 INFO BlockManagerMasterEndpoint: Registering block manager 10.11.13.104:38861 with 366.3 MiB RAM, BlockManagerId(3, 10.11.13.104, 38861, None)
20/10/19 09:07:08 INFO BlockManagerMasterEndpoint: Registering block manager 10.11.13.108:45214 with 366.3 MiB RAM, BlockManagerId(6, 10.11.13.108, 45214, None)
20/10/19 09:07:08 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.11.13.103, executor 1, partition 0, PROCESS_LOCAL, 68212 bytes)
20/10/19 09:07:08 INFO BlockManagerMasterEndpoint: Registering block manager 10.11.13.110:36776 with 366.3 MiB RAM, BlockManagerId(4, 10.11.13.110, 36776, None)
20/10/19 09:07:08 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.11.13.103, executor 1, partition 1, PROCESS_LOCAL, 68066 bytes)
20/10/19 09:07:08 INFO BlockManagerMasterEndpoint: Registering block manager 10.11.13.101:37851 with 366.3 MiB RAM, BlockManagerId(0, 10.11.13.101, 37851, None)
20/10/19 09:07:09 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 10.11.13.103:39283 (size: 3.2 KiB, free: 366.3 MiB)
20/10/19 09:07:09 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.11.13.103:39283 (size: 120.0 B, free: 366.3 MiB)
20/10/19 09:07:09 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.11.13.103:39283 (size: 547.0 B, free: 366.3 MiB)
20/10/19 09:07:09 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.11.13.103:39283 (size: 553.0 B, free: 366.3 MiB)
20/10/19 09:07:16 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, 10.11.13.103, executor 1): com.google.common.util.concurrent.ExecutionError: java.lang.OutOfMemoryError: Java heap space
    at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
    at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
    at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
    at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
    at org.janelia.alignment.match.cache.CanvasDataCache.getData(CanvasDataCache.java:177)
    at org.janelia.alignment.match.cache.CanvasDataCache.getCanvasFeatures(CanvasDataCache.java:224)
    at org.janelia.render.client.spark.SIFTPointMatchClient.lambda$generateMatchesForPairs$9e040152$1(SIFTPointMatchClient.java:219)
    at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
    at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:889)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:889)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:127)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: Java heap space
    at mpicbg.imagefeatures.FloatArray2D.<init>(FloatArray2D.java:31)
    at mpicbg.imagefeatures.Filter.convolveSeparable(Filter.java:267)
    at mpicbg.imagefeatures.FloatArray2DScaleOctave.build(FloatArray2DScaleOctave.java:246)
    at mpicbg.imagefeatures.FloatArray2DSIFT.runOctave(FloatArray2DSIFT.java:515)
    at mpicbg.imagefeatures.FloatArray2DSIFT.run(FloatArray2DSIFT.java:555)
    at mpicbg.imagefeatures.FloatArray2DSIFT.extractFeatures(FloatArray2DSIFT.java:748)
    at mpicbg.ij.SIFT.extractFeatures(SIFT.java:105)
    at org.janelia.alignment.match.CanvasFeatureExtractor.extractFeaturesFromImageAndMask(CanvasFeatureExtractor.java:238)
    at org.janelia.alignment.match.CanvasFeatureExtractor.extractFeatures(CanvasFeatureExtractor.java:200)
    at org.janelia.alignment.match.cache.CanvasFeatureListLoader.load(CanvasFeatureListLoader.java:136)
    at org.janelia.alignment.match.cache.CanvasFeatureListLoader.load(CanvasFeatureListLoader.java:24)
    at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
    at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
    at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
    at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
    at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
    at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
    at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
    at org.janelia.alignment.match.cache.CanvasDataCache.getData(CanvasDataCache.java:177)
    at org.janelia.alignment.match.cache.CanvasDataCache.getCanvasFeatures(CanvasDataCache.java:224)
    at org.janelia.render.client.spark.SIFTPointMatchClient.lambda$generateMatchesForPairs$9e040152$1(SIFTPointMatchClient.java:219)
    at org.janelia.render.client.spark.SIFTPointMatchClient$$Lambda$525/1271652788.call(Unknown Source)
    at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
    at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
    at org.apache.spark.api.java.JavaRDDLike$$Lambda$526/1761638879.apply(Unknown Source)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:889)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:889)
    at org.apache.spark.rdd.RDD$$Lambda$529/1662859711.apply(Unknown Source)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)

20/10/19 09:07:16 INFO TaskSetManager: Starting task 1.1 in stage 0.0 (TID 2, 10.11.13.101, executor 0, partition 1, PROCESS_LOCAL, 68066 bytes)
20/10/19 09:07:16 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 10.11.13.101:37851 (size: 3.2 KiB, free: 366.3 MiB)
20/10/19 09:07:16 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.11.13.101:37851 (size: 120.0 B, free: 366.3 MiB)
20/10/19 09:07:16 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.11.13.101:37851 (size: 547.0 B, free: 366.3 MiB)
20/10/19 09:07:17 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.11.13.101:37851 (size: 553.0 B, free: 366.3 MiB)
20/10/19 09:07:19 INFO TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) on 10.11.13.103, executor 1: com.google.common.util.concurrent.ExecutionError (java.lang.OutOfMemoryError: Java heap space) [duplicate 1]
20/10/19 09:07:19 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 3, 10.11.13.108, executor 6, partition 0, PROCESS_LOCAL, 68212 bytes)
20/10/19 09:07:20 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 10.11.13.108:45214 (size: 3.2 KiB, free: 366.3 MiB)
20/10/19 09:07:20 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.11.13.108:45214 (size: 120.0 B, free: 366.3 MiB)
20/10/19 09:07:20 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.11.13.108:45214 (size: 547.0 B, free: 366.3 MiB)
20/10/19 09:07:20 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.11.13.108:45214 (size: 553.0 B, free: 366.3 MiB)
20/10/19 09:07:25 INFO TaskSetManager: Lost task 1.1 in stage 0.0 (TID 2) on 10.11.13.101, executor 0: com.google.common.util.concurrent.ExecutionError (java.lang.OutOfMemoryError: Java heap space) [duplicate 2]
20/10/19 09:07:25 INFO TaskSetManager: Starting task 1.2 in stage 0.0 (TID 4, 10.11.13.103, executor 1, partition 1, PROCESS_LOCAL, 68066 bytes)
20/10/19 09:07:32 WARN TaskSetManager: Lost task 1.2 in stage 0.0 (TID 4, 10.11.13.103, executor 1): com.google.common.util.concurrent.ExecutionError: java.lang.OutOfMemoryError: Java heap space
    at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
    at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
    at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
    at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
    at org.janelia.alignment.match.cache.CanvasDataCache.getData(CanvasDataCache.java:177)
    at org.janelia.alignment.match.cache.CanvasDataCache.getCanvasFeatures(CanvasDataCache.java:224)
    at org.janelia.render.client.spark.SIFTPointMatchClient.lambda$generateMatchesForPairs$9e040152$1(SIFTPointMatchClient.java:219)
    at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
    at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:889)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:889)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:127)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: Java heap space
    at mpicbg.imagefeatures.FloatArray2D.<init>(FloatArray2D.java:31)
    at mpicbg.imagefeatures.FloatArray2DScaleOctave.build(FloatArray2DScaleOctave.java:251)
    at mpicbg.imagefeatures.FloatArray2DSIFT.runOctave(FloatArray2DSIFT.java:515)
    at mpicbg.imagefeatures.FloatArray2DSIFT.run(FloatArray2DSIFT.java:555)
    at mpicbg.imagefeatures.FloatArray2DSIFT.extractFeatures(FloatArray2DSIFT.java:748)
    at mpicbg.ij.SIFT.extractFeatures(SIFT.java:105)
    at org.janelia.alignment.match.CanvasFeatureExtractor.extractFeaturesFromImageAndMask(CanvasFeatureExtractor.java:238)
    at org.janelia.alignment.match.CanvasFeatureExtractor.extractFeatures(CanvasFeatureExtractor.java:200)
    at org.janelia.alignment.match.cache.CanvasFeatureListLoader.load(CanvasFeatureListLoader.java:136)
    at org.janelia.alignment.match.cache.CanvasFeatureListLoader.load(CanvasFeatureListLoader.java:24)
    at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
    at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
    at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
    at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
    at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
    at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
    at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
    at org.janelia.alignment.match.cache.CanvasDataCache.getData(CanvasDataCache.java:177)
    at org.janelia.alignment.match.cache.CanvasDataCache.getCanvasFeatures(CanvasDataCache.java:224)
    at org.janelia.render.client.spark.SIFTPointMatchClient.lambda$generateMatchesForPairs$9e040152$1(SIFTPointMatchClient.java:219)
    at org.janelia.render.client.spark.SIFTPointMatchClient$$Lambda$525/1271652788.call(Unknown Source)
    at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
    at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
    at org.apache.spark.api.java.JavaRDDLike$$Lambda$526/1761638879.apply(Unknown Source)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:889)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:889)
    at org.apache.spark.rdd.RDD$$Lambda$529/1662859711.apply(Unknown Source)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)

20/10/19 09:07:32 INFO TaskSetManager: Starting task 1.3 in stage 0.0 (TID 5, 10.11.13.108, executor 6, partition 1, PROCESS_LOCAL, 68066 bytes)
^C20/10/19 09:07:37 INFO SparkContext: Invoking stop() from shutdown hook
20/10/19 09:07:37 INFO SparkUI: Stopped Spark web UI at http://login-gui01.cluster.embl.de:4040
20/10/19 09:07:37 INFO DAGScheduler: ResultStage 0 (collect at SIFTPointMatchClient.java:255) failed in 29.452 s due to Stage cancelled because SparkContext was shut down
20/10/19 09:07:37 INFO DAGScheduler: Job 0 failed: collect at SIFTPointMatchClient.java:255, took 29.684189 s
20/10/19 09:07:37 ERROR ClientRunner: run: caught exception
org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1(DAGScheduler.scala:956)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1$adapted(DAGScheduler.scala:954)
    at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
    at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:954)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:2221)
    at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)
    at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2134)
    at org.apache.spark.SparkContext.$anonfun$stop$12(SparkContext.scala:1967)
    at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357)
    at org.apache.spark.SparkContext.stop(SparkContext.scala:1967)
    at org.apache.spark.SparkContext.$anonfun$new$35(SparkContext.scala:626)
    at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
    at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932)
    at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at scala.util.Try$.apply(Try.scala:213)
    at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
    at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:752)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2093)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2133)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
    at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
    at org.apache.spark.api.java.JavaRDDLike.collect(JavaRDDLike.scala:361)
    at org.apache.spark.api.java.JavaRDDLike.collect$(JavaRDDLike.scala:360)
    at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
    at org.janelia.render.client.spark.SIFTPointMatchClient.generateMatchesForPairs(SIFTPointMatchClient.java:255)
    at org.janelia.render.client.spark.SIFTPointMatchClient.generateMatchesForPairFile(SIFTPointMatchClient.java:144)
    at org.janelia.render.client.spark.SIFTPointMatchClient.run(SIFTPointMatchClient.java:127)
    at org.janelia.render.client.spark.SIFTPointMatchClient$1.runClient(SIFTPointMatchClient.java:94)
    at org.janelia.render.client.ClientRunner.run(ClientRunner.java:38)
    at org.janelia.render.client.spark.SIFTPointMatchClient.main(SIFTPointMatchClient.java:98)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/10/19 09:07:37 INFO ClientRunner: run: exit, processing failed after 0 hours, 0 minutes, 33 seconds
20/10/19 09:07:37 INFO StandaloneSchedulerBackend: Shutting down all executors
20/10/19 09:07:37 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
20/10/19 09:07:37 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/10/19 09:07:37 INFO MemoryStore: MemoryStore cleared
20/10/19 09:07:37 INFO BlockManager: BlockManager stopped
20/10/19 09:07:37 INFO BlockManagerMaster: BlockManagerMaster stopped
20/10/19 09:07:37 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/10/19 09:07:37 INFO SparkContext: Successfully stopped SparkContext
20/10/19 09:07:37 INFO ShutdownHookManager: Shutdown hook called
20/10/19 09:07:37 INFO ShutdownHookManager: Deleting directory /tmp/spark-0c955fd7-d7a8-43a9-baef-26ee3ff5b9c1
20/10/19 09:07:37 INFO ShutdownHookManager: Deleting directory /tmp/spark-ff371395-e371-4aea-8516-66d847475f16

trautmane commented 4 years ago

Hi Martin,

Here are the key parameters we typically run with at Janelia:

--conf spark.executor.cores=30 
--conf spark.executor.memory=500g 
--conf spark.driver.memory=14g
--maxFeatureSourceCacheGb 2
--maxFeatureCacheGb 15

At Janelia, the Spark executors run on nodes with 32 cores, so for each node we give 30 cores to the executor and leave 2 cores for the JVM. We typically run the Spark driver on just one core. We are allocated 15g of memory per core.

Each Spark executor JVM maintains one shared feature source cache containing the source image pixels for each tile and one shared feature cache containing the SIFT features extracted for each tile (or for each tile edge when montage matching). Both --maxFeatureSourceCacheGb and --maxFeatureCacheGb have a default value of 2. My guess is that this is the source of your problem since you are only allocating 4G per worker and it is important leave some memory for non-cache elements.

I suggest you try increasing your worker memory allocation or decreasing the cache sizes.

Here is a little more context to help with deciding how to size the caches ...

The feature source cache improves performance for same layer (montage) matching since tiles can be loaded once and then pulled from cache when rendering each of the potentially four clipped edges. The feature source cache is not as important for cross layer matching since tiles are not clipped - cross layer tiles get rendered once and the extracted features then get cached.

The feature cache improves performance for cross layer matching since features can be extracted once and then reused when looking for matches across different layers.

Technically, you could measure the impact of different cache sizes by comparing runs. I have not done that, but it would be interesting to see. Setting the max cache sizes to 0 will disable caching altogether.

Finally, the caches do have to be cleaned-up when they fill. So it is a good idea to only make them as large as you really need. Output from the tile pair client is sorted to keep similar pairs together with the hope of improving cache hits and allowing smaller cache sizes to be sufficient.

Hope this helps, Eric

martinschorb commented 4 years ago

I've been fiddling around with this now for a while, but I cannot find out what is the role of the spark driver in connection with the Java memory allocation.

I run spark as a standalone and the jobs get assigned to workers properly. However, since the client launches spark-submit with the deploy-mode 'client' it basically means that Java gets assigned resources from the driver which in this case is equal to the submission node. This has very limited resources, so for some reason, no matter how much memory I assign to the workers, it only loads 1024M. This is the reason why it crashes. When I manually enforce launching the driver on one of the worker nodes (deploy-mode 'cluster'), the driver gets assigned one full worker node with all its resources. BUT: the memory issues are gone.

How can I assign specific resources to the driver? I don't want to waste many CPUs for it.

martinschorb commented 4 years ago

And when I run with deploy-mode cluster, this is what it shows me in terms of executors:

Screenshot 2020-10-19 at 17 17 02

This doesn't look right to me, it seems it is only running something on one node, or is it a problem with the UI?

martinschorb commented 4 years ago

in fact this is the only node where things have moved further than:

20/10/19 17:07:56 INFO ResourceUtils: ==============================================================
20/10/19 17:07:56 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
20/10/19 17:07:56 INFO Executor: Starting executor ID 6 on host 10.11.13.130
20/10/19 17:07:56 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40071.
20/10/19 17:07:56 INFO NettyBlockTransferService: Server created on 10.11.13.130:40071
20/10/19 17:07:56 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/10/19 17:07:56 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(6, 10.11.13.130, 40071, None)
20/10/19 17:07:56 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(6, 10.11.13.130, 40071, None)
20/10/19 17:07:56 INFO BlockManager: Initialized BlockManager: BlockManagerId(6, 10.11.13.130, 40071, None)

but not much further. the last update came more than 15 minutes ago...

2020-10-19 17:08:57,880 [Executor task launch worker for task 1] [partition 1] INFO  [org.janelia.alignment.match.cache.CanvasFeatureListLoader] load: exit
2020-10-19 17:08:57,880 [Executor task launch worker for task 1] [partition 1] INFO  [org.janelia.render.client.spark.SIFTPointMatchClient] derive matches between 0000.0014.00446::TOP and 0000.0022.00446::BOTTOM
2020-10-19 17:08:57,880 [Executor task launch worker for task 1] [partition 1] INFO  [org.janelia.alignment.match.CanvasFeatureMatcher] deriveMatchResult: entry, canvas1Features.size=110105, canvas2Features.size=116238
2020-10-19 17:08:58,287 [Executor task launch worker for task 0] [partition 0] INFO  [org.janelia.alignment.match.CanvasFeatureExtractor] extractFeatures: exit, extracted 117342 features, elapsedTime=27966ms
2020-10-19 17:08:58,287 [Executor task launch worker for task 0] [partition 0] INFO  [org.janelia.alignment.match.cache.CanvasFeatureListLoader] load: exit
2020-10-19 17:08:58,287 [Executor task launch worker for task 0] [partition 0] INFO  [org.janelia.render.client.spark.SIFTPointMatchClient] derive matches between 0000.0012.00442 and 0000.0012.00443
2020-10-19 17:08:58,287 [Executor task launch worker for task 0] [partition 0] INFO  [org.janelia.alignment.match.CanvasFeatureMatcher] deriveMatchResult: entry, canvas1Features.size=116677, canvas2Features.size=117342

this looks reasonable, but after that nothing more happened..

martinschorb commented 4 years ago

I somehow have the feeling that the whole spark standalone setup I am running lacks something crucial to work properly.

trautmane commented 4 years ago

Hi Martin,

I set up a new spark cluster that (I hope) is similar to your setup so that I could help you.

A couple of keys that you might have missed are:

The --conf spark.default.parallelism= option should be set to the number of executor cores * the number of executors. I think you only saw 2 active tasks because this parameter was not set.
If you are using the client deploy-mode, you need to use the --driver-memory option instead of the --conf spark.executor.memory= option. Alternatively, you could edit conf/spark-defaults.conf. This is mentioned here: https://spark.apache.org/docs/latest/configuration.html . The driver memory displayed in the web UI seems to show the same value no matter what the actual max memory is for the driver. I don't know why. You can see the actual max by looking at the java process created by spark-submit.

Here are the steps I took to successfully setup and run my test:

#---------------------------
ssh <master-node>

cd ${SPARK_TEST_DIR}

curl -o spark-3.0.1-bin-hadoop2.7.tgz "https://apache.osuosl.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz"
tar xzf spark-3.0.1-bin-hadoop2.7.tgz
cd spark-3.0.1-bin-hadoop2.7

cp conf/log4j.properties.template conf/log4j.properties

./sbin/start-master.sh

# get spark master URL from log
grep "spark://" logs/spark-*Master-1*.out

#---------------------------
ssh <worker-node-1>

cd ${SPARK_TEST_DIR}; export MASTER_URL="spark://<master-node>:7077"

./sbin/start-slave.sh --cores 8 --memory 16G ${MASTER_URL}

# check http://<master-node>:8080/ to confirm first worker is there

#---------------------------
ssh <worker-node-2>

cd ${SPARK_TEST_DIR}; export MASTER_URL="spark://<master-node>:7077"

./sbin/start-slave.sh --cores 8 --memory 16G ${MASTER_URL}

# check http://<master-node>:8080/ to confirm second worker is there

#---------------------------
ssh <driver-node>

cd ${SPARK_TEST_DIR}; export MASTER_URL="spark://<master-node>:7077"

export APP_ARGS="--baseDataUrl ... --pairJson ..."

./bin/spark-submit \
  --class org.janelia.render.client.spark.SIFTPointMatchClient \
  --master ${MASTER_URL} \
  --deploy-mode client \
  --driver-memory 2g \
  --conf spark.executor.cores=8 \
  --conf spark.executor.memory=16g \
  --conf spark.default.parallelism=16 \
  ${APPLICATION_JAR} \
  ${APP_ARGS} \
  --maxFeatureSourceCacheGb 7 \
  --maxFeatureCacheGb 7

Let me know if this helps, Eric

martinschorb commented 4 years ago

I got the general concept. My problem is that I need to dynamically assign the resources of the nodes and I cannot get a driver to launch like that. I can start the master node and register as many workers as I want. However, there is no driver that can connect. If I simply launch another shell on some node it cannot submit executors. It needs to be submitted from the initial submission node. However there, I cannot assign resources to the driver, so we end up with the memory problems. When I try to submit from another node, it cannot assign resources and does not connect.

My simple problem is, how do I properly start the d*mn driver...

By now, I think I tried all the various shell juggling scripts that are around on github for getting it to work. But without success...

this is getting frustrating...

martinschorb commented 4 years ago

aaaaaahh

I have to submit the spark jobs from the same process that starts the master and excecutors...

It seems to work now.

martinschorb commented 3 years ago

Hi,

I just re-tried running the client on a new stack and got this error again.

I tried running the executors with 32GB each and it failed. Then I removed 4GB from each worker node for the driver (same as there driver memory spec), but still...

Archive.zip

trautmane commented 3 years ago

Hi Martin,

This line in slurm-6768634.out indicates that everything ran fine but no matches were found:

20/11/17 16:26:20 INFO SIFTPointMatchClient: generateMatchesForPairs: saved matches for 0 out of 120 pairs (0%) on 4 partitions

Which log has the error?

martinschorb commented 3 years ago

I have these warnings again:

20/11/17 16:26:14 WARN Master: App app-20201117162613-0000 requires more resource than any of Workers could have.
20/11/17 16:26:14 WARN Master: App app-20201117162613-0000 requires more resource than any of Workers could have.

in spark-master.../logs/spark-....

That's why I thought it is a memory issue. I re-ran that data now and still get no matches. This is a bit strange, since the match trials I did look really nice.

Also, for all the workers the stdout fills the .err instead of the .out file, so I figured that something went wrong during execution. In some of the worker log directories, it also puts the jar file, while in others it doesn't.

Looking at the worker's logs, it seems that it indeed does not identify matches:

2020-11-18 10:06:46,536 [Executor task launch worker for task 0] [partition 0] INFO  [org.janelia.alignment.match.CanvasFeatureMatcher] deriveMatchResult: entry, canvas1Features.size=13, canvas2Features.size=6
2020-11-18 10:06:46,536 [Executor task launch worker for task 0] [partition 0] INFO  [org.janelia.alignment.match.MatchFilter] filterMatches: filtered 0 inliers from 0 candidates
2020-11-18 10:06:46,537 [Executor task launch worker for task 0] [partition 0] INFO  [org.janelia.alignment.match.CanvasFeatureMatcher] deriveMatchResult: exit, result={'consensusSetSizes' : [0], 'inlierRatio' : 0.0}, elapsedTime=0s
2020-11-18 10:06:46,537 [Executor task launch worker for task 0] [partition 0] INFO  [org.janelia.render.client.spark.SIFTPointMatchClient] derived matches for 0 out of 30 pairs, cache stats are CacheStats{hitCount=0, missCount=236, loadSuccessCount=236, loadExceptionCount=0, totalLoadTime=9073800455, evictionCount=0}

these are my parameters that I use for launching:

--SIFTfdSize 6 --SIFTminScale 0.75 --SIFTmaxScale 1 --SIFTsteps 4 --matchRod 0.5 --matchMaxEpsilon 3.0 --matchMinInlierRatio 0.0 --matchMinNumInliers 9 --matchMaxNumInliers 200 --maxFeatureCacheGb 6 --maxFeatureSourceCacheGb 6 --renderScale 0.5 --matchIterations 400 --matchMaxTrust 4

and this is the match trial that works for many random tilepairs that I checked:

"siftFeatureParameters" : {
        "fdSize" : 6,
        "minScale" : 0.75,
        "maxScale" : 1.0,
        "steps" : 4
      },
      "matchDerivationParameters" : {
        "matchRod" : 0.5,
        "matchModelType" : "SIMILARITY",
        "matchIterations" : 400,
        "matchMaxEpsilon" : 3.0,
        "matchMinInlierRatio" : 0.0,
        "matchMinNumInliers" : 9,
        "matchMaxTrust" : 4.0,
        "matchFilter" : "SINGLE_SET",
        "matchFullScaleCoverageRadius" : 150.0
      }

with

"stats" : {
    "pFeatureCount" : 3122,
    "pFeatureDerivationMilliseconds" : 962,
    "qFeatureCount" : 3210,
    "qFeatureDerivationMilliseconds" : 857,
    "consensusSetSizes" : [ 22 ],
    "matchDerivationMilliseconds" : 3707,
    "aggregateDeltaXStandardDeviation" : 0.9397684112672783,
    "aggregateDeltaYStandardDeviation" : 0.8933347787834419,
    "consensusSetDeltaXStandardDeviations" : [ 0.9397684112672783 ],
    "consensusSetDeltaYStandardDeviations" : [ 0.8933347787834419 ],
    "overlappingImagePixels" : 397932,
    "overlappingCoveragePixels" : 302708,
    "matchQualityMilliseconds" : 9
  }

Do you spot any discrepancy?

martinschorb commented 3 years ago

It seems I am missing a few parameters...

I now added

--matchModelType SIMILARITY  --matchFullScaleCoverageRadius 150.0

still no matches...

martinschorb commented 3 years ago

Is it the filters? How can I control these in the parameters?

trautmane commented 3 years ago

By default, the built-in filter is enabled for the SIFTPointMatchClient (equivalent to including filter=true in the match trial render parameters URL). The filter is intended as a simplistic intensity correction measure. You can disable it in the client by including --renderWithFilter false.

I'm happy to take a closer look to see if I can identify what's different between your trial run and your client run.

Minimally, please send me:

the trial json
the render parameters json for both tiles in the trial pair (you can get this by clicking on the parameters links in the trial page)
the tile pairs json file you are giving to the client that contains the same tile pair that is in the trial
the driver log from the client run that fails to produce matches (the important piece I need is the log of the parameters for the run which should be near the beginning of the driver log)
the executor/worker log from the client run containing output from the task for the same tile pair that is in the match trial

That should be enough to spot the issue. If not, I might ultimately need the source tile images for the pair.

martinschorb commented 3 years ago

Oh gosh, my stupid...

Thanks for that simple suggestion of putting the calling command into the log. That lead me to my mistake, I was pulling the parameters from a wrong JSON file still referring to the wrong pixel scaling...

Indeed I have not had any clipping parameter in there. Does it make sense to use it? What is the default overlap zone it search in for in-plane tiles (TOP-BOTTOM,...)?

trautmane commented 3 years ago

Clipping greatly improves performance for montage (in-plane) tile matching. To see how much, just run 2 trials - one with clipping and one without. There is no default overlap zone (the default is no clipping). You specify how much to clip through the --clipWidth and --clipHeight parameters supported by the various match clients.

How much to clip simply depends upon how much overlap there was during acquisition. Ideally, you want to clip your largest possible overlap but no more.

In general, same layer match sources should be clipped and rendered at higher scales while cross layer match sources should not be clipped and can be rendered at lower scales.

saalfeldlab / render

SIFTPointMatchClient spark memory issues #110