When some resources are not satisfied and the conf of dependency-timeout-check is specified, it will throw exception. like:
2021-12-09 06:18:04 INFO ApplicationMaster:1199 - Successfully started container container_e03_1582553233674_1290236_01_000033
2021-12-09 06:18:04 ERROR TFRuntime:149 - Failed to check dependency timeout.
java.lang.NullPointerException
at com.linkedin.tony.runtime.MLGenericRuntime.lambda$groupDependencyTimeout$1(MLGenericRuntime.java:211)
at java.util.stream.ReferencePipeline$5$1.accept(ReferencePipeline.java:227)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.LongPipeline.reduce(LongPipeline.java:443)
at java.util.stream.LongPipeline.max(LongPipeline.java:406)
at com.linkedin.tony.runtime.MLGenericRuntime.groupDependencyTimeout(MLGenericRuntime.java:212)
at com.linkedin.tony.runtime.MLGenericRuntime.isHealthy(MLGenericRuntime.java:147)
at com.linkedin.tony.ApplicationMaster.monitor(ApplicationMaster.java:749)
at com.linkedin.tony.ApplicationMaster.run(ApplicationMaster.java:422)
at com.linkedin.tony.ApplicationMaster.main(ApplicationMaster.java:356)
Solution
To get the accurate running tasks info, we should make dependency-group-timeout check ignored until all tasks scheduled.
Tips:
I add the test case for above meeting problems, if you remove the MLGenericRuntime.groupDependencyTimeout, and then you could rerun this testPartialTaskScheduledShouldPass test case and reproduce the problem.
Bug Fix
When some resources are not satisfied and the conf of dependency-timeout-check is specified, it will throw exception. like:
Solution
To get the accurate running tasks info, we should make dependency-group-timeout check ignored until all tasks scheduled.
Tips: I add the test case for above meeting problems, if you remove the
MLGenericRuntime.groupDependencyTimeout
, and then you could rerun thistestPartialTaskScheduledShouldPass
test case and reproduce the problem.