Closed ZeweiChen11 closed 4 years ago
Looks like Lossing executor is caused by OOM: java heap space
, please check your executor's log.
The same scenario for ImageNet also reports error in executor:
“# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 8388608 bytes for committing reserved memory.”
Executor will lose due to this error above. But why didn't report the same error trace whatever the dataset is?
When trying to train resnet-50 example on Place365 with engine mklblas, spark executor will exist with error like:
Driver log:
Script:
Here, there are 2 executors per node and 4 work nodes in cluster totally. Memory configuration per node: 192GB DRAM + 512GB DCPMM. But we didn't meet this error when using mkldnn engine.